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Preface 



In recent years the Internet has seen a tremendous growth in terms of amount 
of traffic and number of users. At present, the primary technical objective is 
to provide advanced IP networking services to support an evolving set of new 
applications like, for example, IP-telephony, video teleconferencing, Web TV, 
multimedia retrieval, remote access to and control of laboratory equipment and 
instruments. To achieve this objective, the Internet must move from a best- 
effort paradigm without service differentiation to one where specific Quality of 
Service (QoS) requirements can be met for different service classes in terms of 
bandwidth, delay, and delay jitter. Another crucial goal is the ubiquitous deploy- 
ment of advanced services over the global network resulting from the effective 
integration of wireless and satellite segments into the future Internet. 

Although many researchers and engineers around the world have been work- 
ing on these challenging issues, several problems remain open. 

The Tyrrhenian International Workshop on Digital Communications 2001, 
which focused on the Evolutionary Trends of the Internet, was conceived as a 
highly selective forum aimed at covering diverse aspects of the next generation 
of IP networks. 

The members of the Technical Program Committee of the workshop concen- 
trated their efforts on identifying a set of topics that, although far from being 
exhaustive, provide a sufficiently wide coverage of the current research challenges 
in the field. 

Eight major areas were envisioned, namely WDM Technologies for the Next 
Generation Internet, Mobile and Wireless Internet Access, QoS in the Next Gen- 
eration Internet, Multicast and Routing in IP Networks, Multimedia Services 
over the Internet, Performance Modeling and Measurement of Internet Proto- 
cols, Dynamic Service Management, Source Encoding and Internet Applications. 

With the invaluable help of the session organizers, 46 papers, partly invited 
and partly selected on an open-call basis, were collected for presentation at 
the workshop and publication in this book. We believe that the contributions 
contained in these proceedings represent a timely and high-quality outlook on 
the state of the art of research in the field of multiservice IP networks, and we 
hope they may be of use for further investigation in this challenging area. 
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Progressive Introduction of Optical Packet 
Switching Techniques in WDM Networks 

Marc Vandenhoute Francesco Masetti Amaury Jourdan 
and Dominique Chiaroni 

(1) Alcatel USA, 1201 E. Campbell Road, Richardson, Texas 75081-1936, USA 
(2) Alcatel CIT, Route de Nozay, F-91460 Marcoussis, France 



Abstract. The transport network, mainly based on optical infrastructure, see a 
traffic increase, which introduces new requirements and challenges. This paper 
provides a summary of the trends that will bring bandwidth optimisation in 
WDM core networks, and will thus require the progressive introduction of 
optical packet switching techniques. 



1 Introduction 

The transport network is currently experiencing a doubling of the demand every 8-15 
months. In the view of most operators, precedence of data traffic over voice is already 
a reality, and expected to represent 90% within three years. Optimising the network at 
all levels (core and metropolitan, figure 1) for data applications, and the underlying IP 
support protocol is definitely the trend in the next generation of routing and switching 
products prepared by all vendors. 







Fig. 1. Transport network schematics 
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If WDM transmission is clearly expected to meet these capacity requirements, the 
situation is clearly not settled for “layer- 1”, “layer-2” and “layer-3” routing and 
switching technologies. Hence the efforts to develop a new generation of networking 
products, putting forward scalability and flexibility as most critical specifications and 
trying to avoid complex protocol stacks featuring replication of functionality between 
different layers. 

This paper will discuss how "fast" optical switching, adopting packet-based 
techniques, can improve network organisation and bandwidth efficiency in WDM 
networks, thanks to unique scalability and modularity features, and the recent 
maturity of the technology. In this discussion, we focus on the evolution of the core 
network; metropolitan network architectures will typically be subject to partially 
different requirements, as discussed in [8]. 



2 Evolution Trends 

Put in an historical perspective, we can identify three major trends in optical 
internetworking which will clarify this functional split. 



2.1 Switching Paradigm 

A systematic pattern can be identified over the last 30 years of networking where the 
dominant switching paradigm (in high-speed networks) has been cycling back and 
forth between circuit switching and packet/cell switching technologies. 

An earlier manifestation of this pattern was the emergence of Asynchronous 
Transfer Mode (ATM) cell switching as a more flexible and efficient alternative to 
Time Division Multiplexing (TDM). While ATM did bring a more flexible switching 

granularity, it (initially |0 did retain a strict connection-oriented forwarding paradigm, 
in the form of the establishment of Virtual Paths and Connections. Eventually, 
another packet switching technology based on the Internet protocol (IP) would prove 
to be the more ubiquitous service platform. Its native connection-less switching 
paradigm would eventually be complemented by a connection-oriented mode, in the 
form of (Generalised) Multi-Protocol Label Switching (GMPLS). 

A determining factor in this evolution has been the progress in forwarding and 
switching technology. Indeed, an interesting episode in this evolution occurred when 
an innovative solution to the perceived lack of forwarding performance of software- 
based routers (e.g. early versions of Cisco’s 7500 series) was introduced with the 
concept of “IP/ ATM shortcut routing”. 

The idea of applying high performance hardware-based ATM switching engines 
was introduced with Ipsilon’s Flow Management Protocol (FMP) [12] and Cisco’s 
Tag Switch Routing (TSR) [13]. A slightly different approach, with separate control 
planes for the ATM and IP layers, was proposed by the ATM Forum in the form of 
Multi-Protocol Over ATM (MPOA) [14]. The common idea behind all of them was 
the establishment of edge-to-edge transparent circuits (in casu, ATM VP/VCs) under 
the control of a native IP control layer. 



' a connection-less variant of ATM was proposed, but never deployed 
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Eventually, of course, the emergence of native high performance IP-oriented 
network processors would obviate the need for ATM cell switching and allow the 
building of the gigabit-class IP/MPLS routers we find on the market today. 

Interestingly, we can identify a lot of parallels with the current IP/optical network 
evolution: 

• GMPLS is the native IP incarnation of a generic UNI/NNI signalling mechanism 
which we use today to establish edge-to-edge (wavelength) paths in WDM 
optical networks. This protocol evolved from a connection-oriented path 
establishment mechanism targeted at the IP client layer, to a generic signalling 
mechanism for the widest possible variety of connection-oriented transport 
technologies, including SDH/SONET, G.709-based and, more generally, DWDM 
networks [9][1 1]. 

• The current OWS models call for the signalled establishment of shortcut 
lightpaths between edge routers. This form of optical wavelength switching looks 
like the only alternative today to achieve switching capacities beyond the 
capabilities of current routers. Optical cross-connects (OXC) with capacities in 
the tens of terabits are achievable today, well beyond the capabilities of the 
current generation of routers 

• Top-end network processors (10 Gbps arriving now) seem to lag one generation 
behind top-end wavelength switching subsystems (40 Gbps starting to be 
deployed). However, market pressure make it very unlikely that deployment of 
40 (and later 80 or 160) Gbps transmission systems will be artificially slowed 
down to match the port capacity of high-end routers. This essentially mandates, 
for the foreseeable future, a network architecture where electronic processing 
nodes remain positioned at the edge of an optically switched/routed core network 
and some form of optical switching is mandated for the core nodes 

• In the design of switching fabrics for individual routers, hybrid approaches are 
emerging with the introduction of optical (frame) switching elements to alleviate 
the scalability problems of current electronic switching fabrics. These hybrid 
network elements are a clear foreboding of a more fundamental shift to end-to- 
end optical switching, just as ATM HW switching elements were the foreboding 
of the current generation of network processors. 

• The fundamentally different transfer modes between the IP and underlying 
transport layer (ATM cells in the former, optical frames in the latter) introduce 
some level of opaqueness, resulting in sub-optimal solutions wrt. multiplexing, 
framing, QoS control, ... In the case of IP/ ATM, the segmentation to which IP 
packets were subjected resulted in awkward AAL5-frame aware cell processing 
in ATM switches [16]. In the all-optical domain, the absence of any processing 
capabilities on optical headers makes it more difficult to perform basic processes 
such as congestion control (selective frame dropping) loop prevention (e.g. 
through header Time-To-Live decrementing), QoS diffserv-like 
marking/remarking, ... 

• The differences between the overlay, augmented and peer models currently 
investigated in the IETF IPO group [9] are not dissimilar to the differences 
between the MPOA (overlay) and FMP/TSR (integrated/peer) approaches in 
IP/ATM shortcut routing. They relate to the different degrees of integration 
between the routing layers in the underlying transport and IP client layer. 
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Fig. 2. Role of “shortcutting”in the router capacity evolution 



Analysing the analogy further, we can also however identify key differences: 

• IP over ATM shortcutting involved the mapping of 2 levels of statistical 
multiplexing network layers (IP and ATM), which were themselves layered on 
top of a deterministic multiplexing layer (SDH/SONET). Some simplification in 
the protocol stacks did occur since with more direct IPAVDM mapping [8]. 
However, this does not imply the absence of any conflicts between control 
mechanisms at different layers, as exemplified by the continued debate around 
the interworking of survivability mechanisms (protection, restoration, ...) at the 
IP, SDH/SONET and wavelength layer. 

• The (virtual) label space in IP/ATMN shortcutting was quite large (up to 64K 
channels). However, in the case of a wavelength switched transport network, the 
GMPLS label space will have to be mapped to physical wavelengths and, hence, 
is unlikely to exceed a few hundred for an individual link in the foreseeable 
future. 

• GMPLS -controlled lightpaths (and also SDH/SONET trails) are physical 
channels with a fixed capacity, whereas ATM VP/VCs could be virtual 
connections with adjustable capacity. The notion of flow and/or burst switching 
can be seen as the link between the two worlds, introducing also the need to 
identify the appropriate switching granularity according to network requirements. 

Summarizing, from our experience with IP/ ATM shortcutting we can draw some 

insights on our current IP/WDM deployment efforts : 

• IP/ ATM deployment scenarios typically resulted in a de facto establishment of a 
full mesh of channels between edges. Signalling latencies, limited switching 
capacity of edge routers (relative to the core OXCs) and the topological diversity 
of traffic seen in core networks makes it unlikely that this will be any different in 
optical wavelength switched networks: a full mesh of lightpaths for default 
connectivity will likely be complemented by additional on-demand lightpaths. As 
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a consequence, the control scalability problems resulting from the full-mesh 
connectivity encountered in IP/ ATM overlays will also occur in IP/WDM overlay 
networks. These problems related mainly to the overhead of routing peer 
adjacencies over fully meshed networks, as documented in [15]. 

• Furthermore, of particular concern for IP/optical networks, may become the 
(physical) label space which will be a few orders of magnitude smaller than in 
earlier IP/ATM deployment scenarios. Taking into account additional 
requirements on the label/wavelength^ace for features such as survivability, 

bandwidth-on-demand, QoS schemes E) traffic isolation (e.g. for VPNs), ... it 
becomes increasingly clear that the number of nodes in a fully meshed optically 
switched network may quickly become limited by the label space necessary to 
interconnect them, making it unlikely that large combined metro/core networks 
will be build using a flat transparent topology. 

• This problem will be compounded by the fact that the capacity of the lightpaths 

between edge nodes will closely follow optical technology trends. In essence, in 
an expanding OWS network, both the large number of lightpaths from an edge 
node (driven by a need for full mesh connectivity, determined by the increasing 
number of other edge nodes) AND the fixed capacity of those lightpaths (10 
Gbps -> 40 Gbps -> 80 Gbps -> 160 Gbps -> ...) will increase significantly. This 
“high lightpath count / fat fixed pipes” combination may result in very poor 
efficiency overall, due to lack of bandwidth flexibility and capacity 
manageability. 

These considerations lead us to the conclusion that, in order to tackle some of the 
control and scalability issues associated with OWS, a new network architecture 
should emerge in time with following features : 

• Support for both connection-less and connection-oriented forwarding paradigms, 
reflecting the complementarity currently available in the IP client layer. 
Connection-oriented forwarding should be based on virtual, rather than physical 
path establishment [5]. 

• Low granularity optical switching, at the level of individual flows and/or (time) 
slots (i.e. fractions of a wavelength), rather than complete wavelengths, allowing 
for the flexible bandwidth management necessary in very high capacity networks. 

• Hybrid switching in the optical and electronic domain : complex forwarding 
(inch multicast & VPN), statistical multiplexing w/ QoS control, . . . decisions on 
control packets will continue to be performed in the electronic domain, but 
associated (large) payloads should remain in the optical domain in order to 
benefit from the bandwidth scalability associated w/ optical switching. This 
aspect will be further investigated in a later chapter. 

A number of variants of this novel switching paradigm are currently being 
discussed, with Optical Burst Switching as its most obvious exponent [1][2][3][4][6]. 



2.2 Optical Network Transparency 

An important enabler of this network evolution is the increased interest in building 
transparent optical networks. 



^ which e.g. might mandate the assignment of a separate edge-to-edge lightpath for individual 
(a.o. diffserv) classes of service 
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Optical transparency could bring a number of obvious advantages : 

• Maintain the integrity of the light in the optical path to reduce the cost : O/E/0 
conversions in general and high-speed electronic and opto-electronic components 
and subsystems constitute a very important part of the cost of current network 
elements 

• Provide some bit-rate scalability ; optical switching and transmission elements 

are often bit-rate agnostic, hence improving the prospect for future upgradahility 
of an optical infrastructure. Of course, genuine transparency will largely remain 
an elusive target since edge systems will never be truly bit-rate agnostic, neither 
will be monitoring subsystems within the optical network. 

However, it can be expected that these networks elements will not remain an 
impediment to build a transparent data path within the optical backbone network. 
Edge systems could be incrementally upgraded at the periphery with several co- 
existing generations of edge systems exchanging data at various hit-rates within 
the optical transport network. Monitoring systems targeted at specific bit-rates 
could coexist and would operate in a transparent (sampling) manner (i.e. attached 
via splitters to the optical channel, hence not obstructing the data path). 

• Create a new framing transparent to the client protocol (Framing agnosticism) : 
the major benefit of optical network transparency would eventually be its framing 
agnosticism, i.e. the capability to have different kinds of client layers PHY (and 
higher layer) framing co-existing on a single optical infrastructure, even 

transparent to the switching elements on the optical path 0 This could go as far 
as allowing forms of asynchronous optical switching to coexist with more 
traditional synchronous forms of transmission, e.g. G.709 framing with 
asynchronous burst framing. 

Of course, transparent optical networking will remain controversial for the 
foreseeable future, as it essentially implies a return to analogue network engineering 

rules El and a departure from the generally heralded and clearly demarcated network 
model imposed by the Synchronous Digital Hierarchy (SDH/SONET) suite of 
specifications. 

The future will show whether the potential functional advantages of network 
transparency will weigh up against the obvious operational disadvantages. However, 
we definitely foresee a future where increasing levels of transparency will be 
introduced in optical wavelength switched (OWS) networks. For optical packet 
switching, the most challenging level of transparency remains the bit-rate 
transparency since the existing fast technology imposes today pulse degradations in 
the time domain that forces the use of 3R regeneraotrs. Without regeneration, the 
switching throughput of a switch is limited and becomes in direct concurrence with 
existing electronic technologies. A reasonable transparency is probably in the 
adoption of a new optical framing (burst), and in the exploitation of all-optical packet 
regenerators to replace advantageously costly O/E/0 regenerators. 



^ one could imagine part of an optical frame switch, featuring transparent optical switching 
elements, operating in a static cross-connect-mode for the establishment of long-duration 
OWS lightpaths. 

^ necessitating a.o. the introduction of routing rules based on optical channel impairments 
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2.3 Separation of Control & Data 

A third enabler in optical networking is the evolution towards separation between the 
control plane and the data plane, already alluded to in a previous chapter. 

This evolution can be documented in three steps : 

• classic data network architectures where control information is exchanged in- 
band with normal data H This can clearly be observed in current IP networks 
where general control (ICMP), routing (OSPF, BGP, ...), signalling (MPLS, 
IGMP, ...) and resource management (RSVP, ...) information is exchanged on 
the same carrier as normal data, albeit, sometimes, subject to relative 
prioritisation 0 

• OWS network architectures where GMPLS signalling and routing control data 
can be exchanged out of band wrt the data-carrying lightpaths [10] 

• Optical packet and burst switching solutions where not only control information, 
but also individual packet/burst header information is transported apart from the 
data payloads on separate wavelengths. 

This separation is inevitable due to the complementarity between the complex 
processing capabilities of electronic control systems and the bit-rate agnostic 
switching capability of emerging optical components. It is unlikely to disappear in the 
foreseeable future, given the state of current research in optical computing. 

The separation of control poses some unique challenges, which are only now being 
investigated. Issues abound, as exemplified here: 

• The correlation between information elements on separate carriers (e.g. a 
signalling message and the actual data on its associated lightpath) are not trivial 
to observe, since monitoring capabilities on the optical lightpath are limited, 

especially if some form of non-intrusive 0monitoring is implemented 

• Time-synchronisation between information elements on separate carriers will not 
only be impacted by optical characteristics (e.g. differential fibre propagation 
speeds), but will require particular discipline in the design of control electronics 
(requiring latency control at the ns granularity level) and network layout 
(compensating for optical paths with varying lengths). 



3 Optical Burst Switching 

Optical burst switching has the potential to reconcile the optimisation of resource 
utilisation and service differentiation from packet switching, with application of both 
connection-less and (GMPLS-based) connection-oriented forwarding [5] techniques, 
and the scalability of wavelength cross-connects. 

IP routers are interconnected to a layer- 1/2 optical system performing both traffic 
aggregation and core switching in a much more flexible way than a cross-connect, but 
at a lower cost than an IP router. In this network approach, depicted in figure 3, IP 



^ we are making abstraction here of management plane information which, in current transport 
networks, is already often transported out-of-band 
^ e.g. through diffserv DSCP marking 

^ i.e. not involving the O/E/0 conversion of the complete data stream 
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packets are concatenated into optical “bursts”, of larger dimension, at edge nodes, and 
are then routed as a single entity through the network, within core optical packet 
routers [2]. 



Optical 




The advantage of this aggregation is to relax drastically the forwarding speed 
required from core routers, and to scale up their forwarding capability by at least one 
order of magnitude, well within the multi-Tbps range. In addition, with this approach, 
it becomes possible to consider groups of wavelengths (WDM ports) as a single 
resource (of typ. 300-600 Gbps/s of capacity), and therefore to improve the logical 
performance and/or decrease the memory requirements with respect to IP routers with 
single wavelength processing capabilities. In a first step, this may happen within the 
switching fabric, where optics could be used to reach larger throughputs (i.e. 10s of 
Tbps) and where the control and scheduling could become the bottleneck of the 
system. Later, it can be applied to networking concepts. 

In fact, this burst-switching principle is in addition very much optics-friendly, 
since fast optical switching technology requires some specific framing to avoid the 
loss of any payload data. Optical switching also offers the perspective of large 
scalability (current research realisation have demonstrated up to 10 Tbps) in a single 
stage configuration, avoiding complex interconnection and improving foot-print. It 
also is less sensitive to an increase of the line-rate, which is very important in view of 
the on-coming evolution towards 40 Gbps/s transmission. 

In the long run, the ultimate target could be to reach an all-optical implementation 
of such a network, and additional key functions, such as buffering or regeneration that 
have been demonstrated in the laboratory using optical devices (optical fibre delay 
lines and non linear optical elements, respectively). However depending on 
cost/performance trade-offs to reduce the time to market an intermediate opto- 
electronic solution could be introduced first. 

A more detailed analysis of the current state of research and enabling optical 
technologies for Optical Burst Switching can be found in [2] and [7]. 



4 Conclusions 

Optical wavelength switched networks are today’s answer to the capacity growth 
requirements of Internet backbones. 
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However, from an historical perspective, if these IPAVDM shortcut solutions do 
offer some relief in the short and medium term, we can only expect that they will also 
be subject to some of the scalability issues experienced by IP/ATM-based shortcut 
solutions. 

Moreover, the gradual introduction of two novel concepts in OWS transport 
networks, optical transparency HI and separation of data and control, will be essential 
milestones. Successful mastery of these new network architectural elements will be a 
condition sine qua non for the introduction of transparent end-to-end optical 
packet/burst switching. 

Optical packet/burst switching technologies are a natural answer to a number of 
concerns with OWS networks, offering desirable benefits in term of both network 
efficiency and control scalability. 
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Abstract. As new bandwidth-hungry IP services are demanding more 
and more capacity, transport networks are evolving to provide a recon- 
Egurable optical layer in order to allow fast dynamic allocation of WDM 
channels. To achieve this goal, optical packet-switched systems seem to 
be strong candidates as they allow a high degree of statistical resource 
sharing, which leads to an efficient bandwidth utilization. In this work, we 
propose an architecture for optical packet-switched transport networks, 
together with an innovative switching node structure based on the con- 
cept of per-packet wavelength routing. Some simulations results of node 
operation are also presented. In these simulations, the node performance 
was tested under three different traffic patterns. 



1 Introduction 

Telecomunication networks are currently experiencing a dramatic increase in de- 
mand for capacity, driven by new bandwidth-hungry IP services. This will lead 
to an explosion of the number of wavelengths per fiber, that can’t be easily 
handled with conventional electronic switches. To face this challenge, networks 
are evolving to provide a reconfigurable optical layer, which can help to relieve 
potential capacity bottlenecks of electronic-switched networks, and to efficiently 
manage the huge bandwidth made available by the deployment of dense wave- 
length division multiplexing (DWDM) systems. 

As current applications of WDM focus on a relatively static usage of single 
wavelength channels, many works have been carried out in order to study how 
to achieve switching of signals directly in the optical domain, in a way that 
allows fast dynamic allocation of WDM channels, to improve transport network 
performance. Two main alternative strategies have been proposed to reach this 
purpose: optical packet switching P, |2| and optical burst switching 0,0. 

In this article, we first introduce optical packet and burst switching ap- 
proaches. Then an architecture for packet-switched WDM transport networks 
and a novel optical switching node are proposed. Some simulation results of 
node operation, under different traffic patterns, are also presented. 



S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 10-^3 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Optical Packet and Burst Switching 

Optical packet switching allows to exploit single wavelength channels as shared 
resources, with the use of statistical multiplexing of traffic flows, helping to 
efficiently manage the huge bandwidth of WDM systems. Several approaches 
have been proposed to this aim PI, p|. 

Most proposed systems carry out header processing and routing functions 
electronically, while the switching of optical packet payloads takes place directly 
in the optical domain. This eliminates the need for many optical-electrical-optical 
conversions, which call for the deployment of expensive opto-electronic compo- 
nents, even though most of the optical components, needed to achieve optical 
packet switching, still remain too crude for commercial availment. 

Optical burst switching aims at overcoming these technological limitations. 
The basic units of data transmitted are bursts, made up of multiple packets, 
which are sent after control packets, carrying routing information, whose task 
is to reserve the necessary resources on the intermediate nodes of the transport 
network (see Fig. [Ql. This results in a lower average processing and synchroniza- 



Intermediate Intermediate 
Source „oje 1 node 2 Destination 




T: Offset time 
S: Processing delay 



Fig. 1. The use of an offset time in optical burst switching 



tion overhead than optical packet switching, since packet-by-packet operation 
is not required. However packet switching has a higher degree of statistical re- 
source sharing, which leads to a more efficient bandwidth utilization in a bursty, 
IP-like, traffic environment. 

Since optical packet-switching systems still face some technological hurdles, 
the existing transport networks will probably evolve through the intermediate 
step of burst-switching systems, which represent a balance between circuit and 
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packet switching, making of the latter alternative a longer term strategy for 
network evolution. 

In this work, we have focused our attention on optical packet switching, 
since it offers greater flexibility than the other relatively coarse-grained WDM 
techniques, aiming at efficient system bandwidth management. 

3 Optical Transport Network Architecture 

The architecture of the optical transport network we propose consists of M = 2"* 
optical packet-switching nodes, each denoted by an optical address made of 
m = log 2 M bits, which are linked together in a mesh-like topology. A number 
of edge systems (ES) interfaces the optical transport network with IP legacy 
(electronic) networks (see Fig. EJ. 




Fig. 2. The optical transport network architecture 



An ES receives packets from different electronic networks and performs traffic 
aggregation in order to build optical packets. The optical packet is composed of 
a simple optical header, which comprises the m-bits long destination address, 
and of an optical payload made of a single IP packet, or, alternatively, of an 
aggregate of IP packets. 

The optical packets are then buffered and routed through the optical trans- 
port network to reach their destination ES, which delivers the traffic it receives 
to its destination electronic networks. 
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At each intermediate node, in the transport network, packet headers are 
received and electronically processed, in order to provide routing information 
to the control electronics, which will properly configure the node’s resources to 
switch packet payloads directly in the optical domain. 

The transport network operation is asynchronous; that is, packets can be 
received by nodes at any instant, with no time alignment. The internal operation 
of the optical nodes, on the other hand, is synchronous (slotted). In the model 
we propose, the time slot duration, T, is equal to the amount of time needed to 
transmit an optical packet, with a 40-bytes long payload, from an input WDM 
channel to an output WDM channel. 

The operation of the optical nodes is slotted since the behavior of packets, 
in an unslotted node, is less regulated and more unpredictable, resulting in a 
larger contention probability. 

A contention occurs every time that two or more packets are trying to leave 
a switch from the same output port. How contentions are resolved has a great 
influence on network performance. Three main schemes are generally used to 
resolve contention: wavelength conversion, optical buffering and deflection rout- 
ing. 

In a switch node applying wavelength conversion, two packets trying to leave 
the switch from the same output port are both transmitted at the same time but 
on different wavelengths. Thus, if necessary, one of them is wavelength converted 
to avoid collision. In the optical buffering approach, one or more conteding pack- 
ets are sent to fixed-length fiber delay lines, in order to reach the desired output 
port only after a fixed amount of time, when no contention will occur. Finally, 
in the deflection routing approach, contention is resolved by routing only one of 
the contending packets along the desired link, while the other ones are forwarded 
on paths which may lead to longer than minimum-distance routing paths. 

Implementing optical buffering gives good network performance, but involves 
a great amount of hardware and electronic control. On the other hand, deflection 
routing is easier to implement than optical buffering, but network performance 
is reduced since a portion of network capacity is taken up by deflected packets. 

In the all-optical network proposed, in order to reduce complexity while aim- 
ing at attaining good network performance, the problem of contention is resolved 
combining a small amount of optical buffering with wavelength conversion and 
deflection routing. Our policy can be summarized as follows: 

1. When a contention occurs, the system first tries to transmit the conflicting 
packets on different wavelengths. 

2. If all of the wavelengths of the correct output link are busy at the time the 
contention occurs, some packets are scheduled for transmission in a second 
time, and are forwarded to the fiber delay lines. 

3. Finally, if no suitable delay line is available, at the time the contention 
occurs, for transmission on the correct output port, a conflicting packet can 
be deflected to a different output port than the correct one. 
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4 Node Architecture 

The general architecture of a network node is shown in Fig. 01 It consists of 
N incoming fibers with W wavelengths per fiber. The incoming fiber signals 




Fig. 3. Optical packet-switching node architecture 



are demultiplexed and 2^ wavelengths, from each input fiber, are then fed into 
one of the W switching planes, which constitute the switching fabric’s core. 
Once signals have been switched in one of the second stage parallel planes, 
packets can reach every output port on one of the 2^ wavelengths that are 
directed to each output fiber. This allows the use of wavelength conversion for 
contention resolution, since 2^ packets can be contemporarely be transmitted, 
by each second-stage plane, on the same output link. 

The detailed structure of one of the W parallel switching planes is shown 
in Fig. 0 Each incoming link carries a single wavelength and the switching plain 
consists of three main blocks: an input synchronization unit, as the node is 
slotted and incoming packets need to be aligned, a fiber delay lines unit, used to 
store packets for contention resolution, and a switching matrix unit, to achieve 
the switching of signals. 

These three blocks are all managed by an electronic control unit which carries 
out the following tasks: 

— optical packet header recovery and processing; 

— managing the synchronization unit in order to properly set the correct path 
through the synchronizer for each incoming packet; 

— managing the tunable wavelength converters in order to properly delay and 
route incoming packets. 



Optical Packet Switching for IP-over- WDM Transport Networks 



15 




12 2x2 Switch T: time slot duration Header receiver 
flHT) Fiber delay lines fWT Tunable wavelength converter 

Fig. 4. Detailed structure of one of the Wj2^ parallel switching planes 



We will now describe the second-stage switching planes mentioned above, de- 
tailing their implementation. 

4.1 Synchronization Unit 

This unit consists of a series of 2 x 2 optical switches interconnected by fiber delay 
lines of different lengths. These are arranged in a way that, depending on the 
particular path set through the switches, the packet can be delayed of a variable 
amount of time, ranging between Atmm = 0 and Atmax = (1 ~ (1/2)") x T, 
with a resolution of T , where T is the time slot duration and n the number 
of delay lines. 

The synchronization is achieved as follows: once the packet header has been 
recognized and packet delineation has been carried out, the packet start time 
is identified and the control electronics can calculate the necessary delay and 
configure the correct path of the packet through the synchronizer. 

Due to the fast reconfiguration speed needed, fast 2x2 switching devices, 
such as 2 X 2 semiconductor optical amplifier (SOA) switches (Z|, which have a 
switching time in the nanosecond range, must be used. 

4.2 Fiber Delay Lines Unit 

After packet alignment has been carried out, the routing information carried by 
the packet header allows the control electronics to properly configure a set of 
tunable wavelength converters, in order to deliver each packet to the correct delay 
line to resolve contentions. To achieve wavelength conversion several devices are 
available 0, 0, 03. 
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Depending on the managing algorithm used by control electronics, the fiber 
delay lines stage can be used as an optical scheduler or as an optical first-in- 
first-out (FIFO) buffer. 

— Optical scheduling: this policy uses the delay lines in order to schedule the 
transmission of the maximum number of packets onto the correct output link. 
This implies that an optical packet Pi, entering the node at time ti from 
the j-th WDM input channel, can be transmitted after an optical packet 
P2, entering the node on the same input channel at time t2, being t2 > t\. 
For example, suppose that packet Pi, of duration liT, must be delayed of 
di time slots, in order to be transmitted onto the correct output port. This 
packet will then leave the optical scheduler at time ti+dj. So, if packet P2, 
of duration I2T, has to be delayed for ^2 slots, it can be transmitted before 
Pi if t2-\-d2-eh < ^i+di since no collision will occur at the scheduler output. 

— Optical FIFO buffering: in the optical FIFO buffer the order of the packets 
entering the fiber delay lines stage must be maintained. This leads to a 
simpler managing algorithm than the one used for the optical scheduling 
policy, yielding, however, a sub-optimal output channel utilization. In fact, 
suppose that optical packet Pi, entering the FIFO buffer at time ti, must 
be delayed for di time slots. This implies that packet P2, behind packet Pi, 
must be delayed of, at least, di time slots, in order to maintain the order of 
incoming packets. Due to this rule, if packet P2 has to be delayed for d2 < di 
slots, in order to avoid conflict, its destination output port is idle for di — d2 
time slots, while there would be a packet to transmit. 

4.3 Switching Matrix Unit 

Once packets have crossed the fiber delay lines unit, they enter the switching 
matrix stage in order to be routed to the desired output port. This is achieved 
using a set of tunable wavelength converters combined with an arrayed waveguide 
grating (AWG) wavelength router lOj. 

This device consits of two slab star couplers, interconnected by an array of 
waveguides. Each grating waveguide has a precise path difference with respect 
to its neighbours, AX, and is characterized by a refractive index of value n^,. 

Once a signal enters the AWG from an incoming fiber, the input star coupler 
divides the power among all waveguides in the grating array. As a consequence 
of the difference of the guides lengths, light travelling through each waveguide 
emerges with a different phase delay given by: 

AX 

A<P = 2 TTnw X — — ( 1 ) 

A 

being A the incoming signal central wavelength. As all the beams emerge from 
the grating array they interfere constructively onto the focal point in the output 
star coupler, in a way that allows to couple an interference maximum with a 
particular output fiber , dependig only on the input signal central wavelength. 

Figure El shows the mechanism described above. Two signals of wavelength 
Aq and A3 entering an 8 x 8 AWG, from input fibers number 6 and number 1 
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Grating array 




respectively, are correctly switched onto the output fibers number 0 and number 
3, being the wavelength of signals the only routing information needed to achieve 
the required permutation. 

The AWG is used as it gives better performance than a normal space switch 
interconnection network, as far as insertion losses are concerned. This is due 
to the high insertion losses of all the high-speed all-optical switching fabrics 
available at the moment, that could be used to build a space switch intercon- 
nection network. Moreover AWG routers are strictly non-blocking and offer high 
wavelength selectivity. 

After crossing the three stages previously described, packets undergo a final 
wavelength conversion, to avoid collisions at the output multiplexers, where W 
WDM channels are multiplexed on each output link. 



5 Simulation Results 

In this section, we present some simulation results of the operation of one among 
the W j 2^ parallel switching planes, which structure has been shown in Fig.0 

These results have been obtained assuming that the node receives its input 
traffic directly from N edge systems. The edge systems buffers capacity is sup- 
posed to be large enough to make packet loss negligible. Each WDM channel is 
supposed to have a dedicated buffer in the edge system. 

The packet arrival process has been modeled as a Poisson process, with packet 
interarrival times having a negative exponential distribution. As the node oper- 
ation is slotted, the packets duration was always assumed to be multiple of the 
time slot duration T, which is equal to the amount of time needed to transmit 
an optical packet, with a 40-bytes long payload, from an input WDM channel 
to an output WDM channel. 

As far as packet length is concerned, the following probability distributions 
were considered: 
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Fig. 6. Packet duration probability distributions: empirical distribution (a), uniform 
distribution (b), constant duration (c) 



1. Empirical distribution. Based on real measurements on IP traffic 

we assumed the following probability distribution for the packet length L: 

( po = P{L = 40 bytes) = 0.5833 
} Pi = p\l = 576 bytes) = 0.3333 (2) 

[ P 2 = p\l = 1500 bytes) = 0.0834 

In this model, packets have average length equal to 341 bytes. Since a 40- 
bytes long packet is transmitted in one time slot of duration T, the average 
duration of an optical packet is approximatively 9T. Moreover, po> Pi and 
P 2 represent the probability that the packet duration is T, 15T and 38T 
respectively (see Fig. 0(a)). 

2. Uniform distribution. To show a comparison with the empirical model de- 
scribed above, we have modeled the optical packet length as a stochastic 
variable, uniformly distributed between 40 bytes (duration T) and 680 bytes 
(duration 17T). Also in this model, packets have average duration of 9T (see 
Fig.EI(b)). 

3. Constant length. We have also investigated the behaviour of the system when 
packets have a constant duration of value 9T (see Fig. 0(c)). 

These simulations were carried out assuming that no deflection routing algo- 
rithm is implemented. Under this assumption, a packet is supposed to be lost if 
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it can’t be delayed of a suitable amount of time, in order to transmit it onto the 
correct output port. Figures 0 through 1121 show the packet loss probability at 
different traffic loads per wavelength, for different values of the maximum delay 
attainable by the fiber delay lines unit, D = iT (z = 0, 1, 2, • • •). 

Figures ITIPandim report the simulation results for the optical FIFO buffer- 
ing (OFB) policy, in the fiber delay lines unit, while Figs. and ini report 
the results for the optical scheduling (OS) policy. 

It can be seen that, regardless of packet length distribution, the OS policy 
yields a better performance than the OFB policy, with an increasing improve- 
ment as D grows. 

Figures rn^mia.nd II .11 show the values of the ratio 



nps 

noFB 



( 3 ) 



for different values of the maximum delay achievable, D, at different traffic loads 
per wavelength, where Ups £^nd IIppB are the packet loss probability for the 
optical scheduling and optical FIFO buffering policy, respectively. It can be 
pointed out that no significative improvement is experienced as D value is 0, 
1, 2 or 4. 




Average load per wavelength 



Fig. 7. Packet loss probability for the empirical distribution, with FIFO policy, at 
different loads per wavelength 



It can also be seen that this increase is more evident for the uniform distribu- 
tion, and even more for the constant length packets. This happens because the 
system performance is not only influenced by the maximum delay achievable, D, 
but also by the maximum optical packet length. 

In fact, what really influences the system performance is the Lm/D rate, 
being Lm the maximum packet duration and D the maximum delay attainable. 
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Average load per wavelength 



Fig. 8. Packet loss probability for the empirical distribution, with scheduling policy, 
at different loads per wavelength 




Average load per wavelength 



Fig. 9. Packet loss probability for the uniform distribution, with FIFO policy, at dif- 
ferent loads per wavelength 



So as the value of D is much smaller than Lm, the influence of the fiber delay 
lines managing discipline is negligible. When D becomes much larger than Lm, 
on the other hand, the OS policy efficiency improvement becomes more and more 
evident. 

It is now interesting to show the efficiency improvement variation, yielded 
by the OS policy, depending on the optical packet length. Figure [El plots this 
variation at different loads per wavelength, for three values of the packet length. 
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Average load per wavelength 



Fig. 10. Packet loss probability for the nniform distribution, with schednling policy, 
at different loads per wavelength 
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Fig. 11. Packet loss probability for constant packet length, with FIFO policy, at dif- 
ferent loads per wavelength 



for constant length packets, when the maximum delay attainable is D = 16. 
It can be seen that, for Lm = 8, that is Lm = D/2, a significative efficiency 
improvement is experienced, while for Lm = 32, that is Lm = 2D, the optical 
scheduling and optical FIFO buffering policies almost give the same performance. 
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Fig. 12. Packet loss probability for constant packet length, with scheduling policy, at 
different loads per wavelength 
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Fig. 13. Empirical distribution: values of rj at different loads per wavelength and for 
different values of the maximum delay attainable D 



6 Conclusions and Topics for Further Research 

In this work, we proposed an architecture for optical packet-switched transport 
networks. The structure of the optical switching nodes was detailed and the basic 
building blocks were described. Some simulation results were also presented, 
showing a comparison between two different managing policies for the fiber delay 
lines stage: optical scheduling and optical FIFO buffering. 
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Average load per wavelength 

Fig. 14. Uniform distribution: valnes of rj at different loads per wavelength and for 
different valnes of the maximum delay attainable D 
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Fig. 15. Constant packet length: valnes of rj at different loads per wavelength and for 
different valnes of the maximum delay attainable D 



It was shown that, for D « Lm, OS and OFB almost give the same perfor- 
mance. For D » Lm, on the other hand, the optical scheduling policy yields a 
better performance than the optical FIFO buffering policy, because the output 
links are more efficiently exploited. 

Many issues will have to be addressed in the future, such as the detailed 
study of the improvement attainable with the optical scheduling policy depend- 
ing on the optical packet length. Moreover, the behaviour of an optical transport 
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Fig. 16. Gonstant packet length: values of rj at different loads per wavelength, for 
D — 16, Lm = 8, Lm = 16 and Lm = 32 



network, as a whole, will have to be investigated, since a single node operation 
was simulated for this work. Another interesting issue is the implementation of a 
suitable deflection routing algorithm in order to improve network performance, 
varying the optical network topology. 



References 

1. Hunter, D.K., Andonovic, I.: Approaches to Optical Internet Packet Switching. 
IEEE Commun. Mag. (Sep. 2000) 116-122 

2. Yao, S., Mukherjee, B., Dixit, S.: Advances in Photonic Packet Switching: An 
Overview. IEEE Gommun. Mag. (Feb. 2000) 84-94 

3. Yoo, M., Qiao, G.: Just-Enough-Time(JET): A High Speed Protocol for Bursty 
Traffic in Optical Networks. Proc. lEEE/LEOS Tech, for a Global Info. Infrastruc- 
ture (Aug. 1997) 26-27 

4. Qiao, G.: Labeled Optical Burst Switching for IP-over-WDM Integration. IEEE 
Commun. Mag. (Sep. 2000) 104-114 

5. Hunter, D.K. et ah: WASPNET: A Wavelength Switched Packet Network. IEEE 
Commun. Mag. (Mar. 1999) 120-129 

6. Renaud, M., Masetti, F., Guillemot, C., Bostica, B.: Network and System Concepts 
for Optical Packet Switching. IEEE Commun. Mag. (Apr. 1997) 96-102 

7. Dorgeuille, F., Mersali, B., Feuillade, M., Sainson, S., Slempkes, S., Foucher, M.: 
Novel Approach for Simple Fabrication of High-Performance InP-Switch Matrix 
Based on Laser- Amplifier Gates. IEEE Photon. Technol. Lett., Vol. 8. (1996) 1178- 
1180 

8. Stephens, M.F.C. et al.: Low Input Power Wavelength Conversion at 10 Gb/s Using 
an Integrated Amplifier/DFB Laser and Subsequent Transmission over 375 km of 
Fibre. IEEE Photon. Technol. Lett., Vol. 10. (1998) 878-880 




Optical Packet Switching for IP-over- WDM Transport Networks 



25 



9. Owen, M. et al.: All-Optical 1x4 Network Switching and Simnltaneous Wavelength 
Conversion Using an Integrated Multi- Wavelength Laser. Proc. ECOC ’98, Madrid, 
Spain, (1998) 

10. Tzanakaki, A. et ah: Penalty-Free Wavelength Conversion Using Cross-Gain Mod- 
ulation in Semiconductor Laser Amplifiers witfi no Output Filter. Elec. Lett., Vol. 
33. (1997) 1554-1556 

11. Parker, C., Walker, S.D.: Design of Arrayed- Waveguide Gratings Using Hybrid 
Fourier-Fresnel Transform Techniques. IEEE J. Selec. Topics Quant. Electron., 
Vol. 5. (1999) 1379-1384 

12. Thompson, K., Miller, G.J., Wilder, R.: Wide-area Internet Traffic Patterns and 
Characteristics. IEEE Network, Vol 11. (1997) 10-23 

13. Generating the Internet Traffic Mix Using a Multi-Modal Length Generator. 
Spirent Communications white paper, http://www.netcomsystems.com 




MPLS over Optical Packet Switching 



F. Callegati, W. Cerroni, G. Corazza, and C. Raffaelli 

D.E.I.S. - University of Bologna, Viale Risorgimento 2 - 40136 Bologna - Italy 
( f callegati , gcorazza , wcerroni , craf f aelli) @deis . unibo . it 



Abstract. This paper deals with the prohlem of connection to wavelength 
assignment in an MPLS optical packet switched network with DWDM links. 
The need to adopt dynamic allocation of connections to wavelengths is outlined 
to avoid congestion and dynamic wavelength assignment algorithms are 
proposed. The main results show the effectiveness of dynamic assignment with 
respect to static one and show that the connection configuration can be 
exploited for performance enhancement 



1. Introduction 

The explosive growth of the Internet demands for larger and larger bandwidth, in 
particular in the core part of the network. The recent introduction and rapid growth of 
the Dense Wavelength Division Multiplexing (DWDM) technology provides a 
platform to exploit the huge capacity of optical fibers. At the same time the 
introduction of the MPLS paradigm in TCP-IP based network promises for effective 
network management and traffic engineering in the future. 

The integration of MPLS with the all-optical networks is a widely discussed issue, 
and proposals such as, for instance, MPIS are emerging. Assuming the availability of 
all-optical packet switching technology, this paper focuses on the problem of 
integration of MPLS and DWDM. 

Optical packet switching has been the subject of several international research 
projects in the last decade (see for instance [1][2][3]) and its feasibility has been 
proved. Most of this work is related to the case of fixed length synchronous packets, 
that provides easier switching matrix design but do not match easily with IP. For this 
reason more recently the case of asynchronous variable length has also been studied 
[4][5], showing that several new problems arise but also that acceptable performance 
may be achieved. The work here presented considers the case of MPLS traffic and 
therefore assumes variable length packets. 

In the case of all optical packet switching congestion resolution may be achieved in 
the time domain by means of queuing and in the wavelength domain by means of 
suitable wavelength multiplexing. 

Queuing is achieved by delay lines (coils of fibers) that are used to delay packets 
as if they were placed for some time into a memory and then retrieved for 
transmission. In particular buffering variable length packets with delay lines is a very 
critical task and the performance depends strongly on the so called buffer time scale 
[6]. This is the unit of delay introduced by the delay lines that, depending on their 
length, delay a packet of a multiple of such unit. In particular the ratio between delay 
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unit and average packet length is a key parameter and for this reason most of the 
results presented in the following are plotted against this quantity. 

Wavelength multiplexing depends on the wavelength allocation strategy in the 
network architecture. The possible alternatives are two [7] [8]: 

• Wavelength Circuit (WC) in which the elementary path within the network for a 
given packet flow is designated by the wavelength, therefore packets belonging to 
the same flow will all be transmitted on the same wavelength; 

• Wavelength Packet (WP) in which the wavelengths can be used as a shared 
resource, the traffic load is spread over the whole wavelength set on an availability 
basis and packets belonging to the same flow can be spread over more 
wavelengths. 

In the case of a purely connectionless network such as an IP network it has been 
proven [9] [10] that the WP approach is by far superior and may largely improve the 
overall performance. 

This works extends these ideas in the case of a connection oriented network, such 
as an MPLS network and provides some insight in the problem related to the 
application of dynamic wavelength allocation to a connection oriented environment. 

The paper is structured as follows. In section 2 a brief overview of the problem of 
integrating MPLS with optical packet switching is presented with particular focus on 
the issues of integrating management of LSPs and wavelength dimension. In section 3 
a formal definition of the problem of LSP grouping on the input/output wavelengths is 
presented providing formulas to quantitatively evaluate such grouping. In section 4 
performance results are presented for a simple wavelength allocation algorithm and in 
section 5 improved wavelength management algorithms are presented that take into 
account the grouping factor. Some conclusions are presented in section 6. 



2. Multiplexing MPLS over DWDM 

MPLS is a connection-oriented protocol (in contrast with regular IP) setting up 
unidirectional Label Switched Paths (LSPs) identified by an additional label added to 
the IP datagrams [11]. With MPLS the network layer functions are partitioned into 
two basic components: control and forwarding. The control component uses standard 
routing protocols to exchange information with other routers, and based on routing 
algorithms, it builds up and maintains a forwarding table. The forwarding component 
has the task to process incoming packets, examine the headers and make forwarding 
decisions, based on the forwarding table. The entire set of packets that a router can 
forward is split into a finite number of subsets, called Forwarding Equivalence 
Classes (FECs). Packets belonging to the same EEC are, from a forwarding point of 
view, indistinguishable and are forwarded in a connection-oriented fashion from 
source to destination along an LSP. The MPLS approach has important consequences 
with respect to e.g. traffic engineering in the IP-layer: one can set up explicit routes 
through the network to optimize the usage of available network resources, or to create 
distinct paths for different classes of quality of service. 

In present proposals for optical networking, such as MPXS, the LSPs are mapped 
into wavelengths, in order to realize end-to-end wavelength switched paths. These 
may not be the most efficient solutions for multiplexing, since the incoming flow of 
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information is bursty by nature. To achieve maximum flexibility in terms of 
bandwidth allocation and sharing of capacity the challenge is to combine the 
advantages of DWDM with emerging all-optical packet switching capabilities to yield 
optical routers able to support label switching of packets. 

Optical routers require a further partitioning of the forwarding component into 
forwarding algorithm, that is routing tables look up to determine the datagram next 
hop destination and switching function that is the physical action of transferring a 
datagram to the output interface properly chosen by the forwarding algorithm. The 
main goal is to limit the electro-optical conversion to the minimum to achieve better 
interfacing with optical WDM transmission systems: 

• the header is converted from optical to electrical and the execution of the 
forwarding algorithm is performed in electronics; 

• the datagram payload is optically switched without conversion to the electrical 
domain. 

This approach suites very well with the MPLS paradigm and exploits the best of both 
electronic and optical technology. Electronics is used in the routing component and 
forwarding algorithm, while optics is used in switching and transmission where high 
date rates are required. 

This paper focuses on this scenario, assuming the availability of an all-optical 
switching matrix able to switch variable length packets, for instance like the one 
described in [5]. The problem of congestion due to temporary overloads is addressed 
by means of queuing in a fiber delay line buffer and by means of wavelength 
multiplexing in the assumption of a Wavelength Packet use of the wavelength 
resource. A very limited number of fiber delay lines are used to solve congestion in 
the time domain. The fibers are shared among all input/output pairs but, due to the 
need for delay allocation before feeding a packet into the buffer, the buffering 
architecture is equivalent to pure output queuing. 

Should a packet be forwarded to a wavelength that is busy, it may be queued or re- 
routed towards a less congested wavelength on the same fiber, thus keeping the same 
network path (same fiber) and exploiting the WP multiplexing. This scheme has 
proved effective in the case of a connectionless network. In particular it has been 
shown that, by means of intelligent wavelength selection algorithms aiming at 
minimizing queue occupancy a reduction in packet loss probability of several order of 
magnitudes can be achieved [9] [10]. These results are obtained in the assumption of a 
complete freedom in output wavelength assignment to incoming packets. 

Unfortunately in a connection-oriented network scenario, such as MPLS, the 
wavelength hopping of packets belonging to the same LSP will cause out of order 
arrivals and updates of the forwarding table. The former issue means more complex 
interfaces at the edge of the optical network for re-sequencing and the latter a possible 
overload for the control function of the optical packet switch. Therefore a trade-off 
has to be found between wavelength hopping for congestion resolution and 
forwarding of packets of the same LSP on the same wavelength as long as possible. 

The reference configuration is an optical packet switch with NxN input/output 
fibers, W wavelengths per fiber. We use the following notation: 

• n: input fiber index (n=l...A0; 

• w: wavelengths index on a fiber (w=l...W); 
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Therefore a wavelength can be identified with the couple of indexes (n,w). We 
assume that wavelength (n,tv) carries LSPs and that we can use the index 

• m: LSP index for a given wavelength and fiber (m=l . . . M^J. 

Each LSP can be identified by the triplet (n,w,m), where m is the LSP index on 
wavelength w of fiber n. 

In the assumption of uniform traffic distribution each LSP will provide on average 
the same traffic load and therefore the same number of M LSPs will be carried by 
input and output wavelengths. Nevertheless, congestion is not uniformly distributed 
and depends on the forwarding set up of LSPs. This concept can be understood 
intuitively as follows: let us assume that the M LSPs incoming on a given wavelength 
are routed to the same output fiber n and all together forwarded to the same 
wavelength w. In the assumption of uniform traffic wavelength w will not carry other 
LSPs. Well, it happens that wavelength w will never experience any congestion and 
its buffer will be totally unused. Obviously the packets arriving cannot overlap since 
incoming from the same input wavelength and no queuing will arise. Therefore the 
buffering resource devoted to wavelength w is useless in this forwarding 
configuration. 

By means of this trivial example we understand that the congestion phenomena are 
related to the forwarding table of the LSPs and in particular to the per wavelength 
grouping of LSPs routed to the same fiber. The previous example for instance 
suggests that, if possible, it is worthwhile to group LSPs incoming from the same 
wavelength and also that the unused buffering resource could be temporarily used to 
relieve the congestion undergoing at some other wavelength (queuing may still arise 
because of random fluctuations in the packet arrival process when no grouping is 
possible). 

The basic idea is to modify the MPLS forwarding table and shift temporarily one 
or more LSPs from the overloaded wavelength to another where some buffering 
resource is available. The choice of the new wavelength among those available is a 
matter of optimization and has to take into account the grouping of LSPs in some 
form. The first step towards the definition of an algorithm to perform this task is to 
define a quantitative measure of the grouping of LSPs. A proposal is presented in the 
following section, followed by numerical examples and by an algorithm that aims at 
minimizing packet loss by exploiting the connection oriented nature of the input 
traffic. 



3. The Grouping Index 

In this section we define an index that is used to take into account and compare the 
different grouping configurations of the LSPs on the input/output wavelengths. Let us 
refer to a group of LSPs as the set of LSPs incoming on the generic wavelength (n,w) 
that are addressed to the same output fiber n'. Lor the generic LSP (n,w,m) it is 
possible to define a parameter g^^^ that measures whether the LSP belong to a group 
and how numerous is that group, defined as: 
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Sn,w,n 



M„ 



( 1 ) 



Here the cardinality (that is the number of LSPs belonging to the group) of the 

group which the LSP belongs to. In the case the LSP does not belong to a group 
because there are no other LSPs on the same wavelength addressed to the same 
output, = 0 and consequently g„„„ = 0. The parameter assumes values in the 
range between 0 and 1, being 1 when the all LSPs on the wavelength belong to the 
same group with for all m : I < m< 

Let us focus on the input ports of the switch and define the input grouping index G. 
On the basis of the given definitions the grouping index for input wavelength 
(n,w) is defined as: 
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and the average input grouping index G for the switch is: 
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In particular G is equal to 1 only when all the LSPs on (n,w) form a single group and 
G=0 when no LSPs belongs to any group. In all other cases 0 < G < 1. 

The same definition can be applied to the output ports of the switch. Let us use the 
index j=l...VT’ for the output wavelengths, meaning that i is the output 

fiber and j is the wavelength on that fiber. In this case the output grouping index G' is 
defined in a similar way as follows. We first focus on the single wavelength 



M) 



G’i,= 



M' 



! ij,n 



Gy m=l 



( 4 ) 



where M'.. is the number of LSPs on the wavelength and gL„is the corresponding of 
the index per LSP defined in equation (1) but referred to the output groups. An output 
group is the set of LSPs assigned to the same output wavelength of the same output 
fiber coming from the same input wavelength of the same input fiber. Consequently 
output group definition takes into account the wavelength assignment performed by 
the switch. 

The average output grouping index G’ is then given by 



G' = 



1 

W'N' 



N' W 






i=l j=l 



( 5 ) 



While G always depends only on LSP routing, G’ does it only in case of static 
assignment, because otherwise, when dynamic wavelength assignment is performed, 
it depends also on the wavelenth assignment algorithm. The upper bound of G' is in 
any case G, and this represents the optimal wavelength assignment because it keeps 
the same grouping as the input on the output. 
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4. Dynamic Wavelength Assignment 

In the normal switch operation it is said that an output wavelength is congested when 
its queue is full. To avoid unbalanced usage of wavelengths and maximum 
exploitation of wavelength multiplexing, the following algorithm, called Round- 
Robin Wavelength Selection (RRWS), is defined: 

• an output wavelength is randomly assigned to each LSP at call set up; 

• any time a packet arrives at queue (i,j) and finds it congested another wavelength is 
searched in a round-robin fashion starting from (fj+l); 

• when a not congested wavelength is found, the LSP is assigned to that wavelength 
by updating the forwarding table so that the packets of the LSP will be forwarded 
to the new wavelength. 

This algorithm is very simple and requires a very limited amount of processing. To 
prove its effectiveness we have compared the packet loss probability for the RRWS 
algorithm with: 

1. a purely connectionless environment, like the one considered in [9], where the 
wavelength is changed packet by packet, packets being considered as IP datagrams 
not associated to specific LSPs; 

2. an MPLS environment without dynamic wavelength assignment multiplexing 
(static case), meaning that the output wavelength per LSP is chosen at call set up 
and never changed regardless of the congestion state of the assigned wavelength. 

The simulation set up is an optical switching matrix with = 4 fibers each carrying 
W = 16 wavelengths, simulated by means of an ad-hoc event driven simulator. The 
number of MPLS connections per wavelength is M = 3 and a fiber delay buffer with 
B = 16 delay lines is placed in front of any output wavelength. The input traffic is 
random, with average packet length of 500 bytes and load equal to 0.8 per 
wavelength, and is uniformly distributed. 

In this simulation for 1 set out of 16 the LSPs are forwarded to the same output 
fiber and grouped on the same wavelength, for 9 sets out of 16 there are 2 LSPs out of 
3 that are forwarded to the same fiber and are grouped on the same wavelength and, 
finally, 6 sets out of 16 are such that the 3 LSPs are forwarded to different output 
fibers and can not be grouped. Such a configuration results from the evaluation of the 
average assignment of the LSPs to the outputs. The assumption of uniform traffic 
pattern leads to 1/N for the probability that a given LSP is addressed to a given output 
fiber. The probability P, that all the 3 LSPs on the same input wavelength are directed 
to the same output is also the probability that, given the output of the first LSP, the 

same value has to be extracted twice more, which leads to =11 N . On the other 
hand when the 3 LSPs are forwarded to 3 different outputs, given the first value the 
other ones can be chosen between the remaining N -1 and N -2 respectively. This 
leads to a probability = (N — V){N — 2)/ . The last case of only two LSPs 
directed to the same output has probability P 2 =l — Py—P^. With N = A the 
probabilities are Pj =1/16, P 2 =9/16 and P 3 =6/16. Assuming the previous routing 
configuration allows us to provide average results with a single simulation. 
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In this average case the input grouping index G can be evaluated as follows. For 
each triplet (n,w,m) the corresponding LSP may have parameter 



8 



n,w,m 



0 if the LSP does not belong to any group 
■2/3 if the LSP belongs to a couple 

1 if the LSP belongs to a triplet 



Because each input wavelength (n,w) carries at most one group, for = 4 and W = 16 
it follows 



0 

4/9 

1 



if (n,w) carries only single LSPs, i.e. for 24 wavelengths out of 64 

if (n, w) carries a couple and a single LSP, i.e. for 36 wavelengths out of 64 

if (m, w) carries a triplet, i.e. for 4 wavelength out of 64 



The resulting input grouping index is 



G = — 

64 



4xl + 36x- + 24x0 
9 



= 0.3125. 



In figure 1 the packet loss probability is plotted as a function of D that is the buffer 
time unit normalized to the average packet length. As already outlined in the 
introduction this is a critical parameter in the case of variable length packets, because 
there is an optimal choice of D due to the trade off between a small buffer (D small) 
and an inefficient utilization of the buffer itself (D large) [6]. 




Fig. 1. Comparison of the RRWS with the static and connectionless case for N=4, W=16, M=3, 
G=0.3125, B = 16, average load per wavelength 0.8. 
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The figure shows that the RRWS algorithm significantly improves the performance 
with respect to the static case, being very close to the connectionless case. The reason 
of this behavior is motivated by the connection-oriented nature of the simulated 
traffic. Because of the grouping some wavelengths experience less congestion than 
the purely connectionless case or, in 1 case out of 16, even no congestion at all. The 
RRWS algorithm exploits such situations and, when congestion arises on one 
wavelength, it shifts the overload traffic on a wavelength not congested. Because of 
the grouping it is very likely that a non-congested wavelength is available to relieve a 
temporary overload. 

To support this intuitive statement in table 1 the packet loss probability is plotted 
for different configurations of the LSPs forwarding table. In particular we have varied 
the number of triplets and couples of LSPs that can be grouped on the outputs. The 
table shows that the more the triplets the lower the overall packet loss probability. It 
also shows that the packet loss probability decreases when the grouping factor G 
increases, that is the quantitative description of the intuitive concepts presented 



Table 1. Packet loss probability with varying numbers of LSPs following the same path at the 
fiber level. 



# of triplets of 
LSPs grouped 


# of couples of 
LSPs grouped 


G 


Packet loss 
probability 


0 


0 


0 


8.395e-04 


1 


9 


0.3125 


4.837e-04 


6 


5 


0.5139 


3.395e-04 


10 


3 


0.7083 


1.482e-04 


14 


2 


0.9306 


2.801e-05 


16 


0 


1 


0 



To further support these ideas in figure 2 the output grouping index G’ and the packet 
loss probability is plotted as a function of the simulation time. It is shown that the two 
curves have a complementary behavior meaning that the packet loss probability 
increases when the grouping index decreases and consequently it exhibits a local 
minimum (or a maximum) when the grouping index is maximum (or minimum). 



5. Algorithm for Grouping Exploitation 

The previous results show that the RRWS algorithm can significantly improve the 
switch performance. The cost to pay is an update of the LSPs forwarding tables any 
time a congestion event occurs and a wavelength hopping is performed. As outlined 
in section 2 this may results in a significant increase in the complexity of the network. 
A possible way around this problem could be to try to exploit as much as possible the 
grouping of LSPs that is not taken into account by the RRWS algorithm. 
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Here we propose an extension of the RRWS algorithm that takes into account the 
grouping configuration inside the switch. The new algorithm is called Grouping 
Wavelength Selection (GWS). It is based on the assumption that connection 
assignment to wavelengths is performed trying to maintain the connections of a given 
group together. The aim is to reduce the number of connection re-assignment to 
wavelengths. 

The algorithm performs as follows: 

• a wavelength is assigned at each logical connection at call set up: a choice that 
optimizes the output grouping index is performed. 

• connections that are grouped with gouping index equal to 1 are never moved 

• any time a packet arrives at queue (i,j) and finds it congested another wavelength is 
searched among the whole set with the minimum number of connections already 
assigned and the minimum value of the grouping index. This avoids to move 
connections to wavelengths with high grouping index causing it to decrease (in 
particular no connections are added when group with grouping index equal to 1 are 
present) 

• once a not congested wavelength / is found, the LSP is assigned to (i,l) by 
updating the forwarding table so that the packets of the LSP use the new 
wavelength /; the number of connections and the grouping index for wavelength I 
are updated. 

Figure 3 shows the percentage of reassignment of LSPs to a new wavelength because 
of congestion for the RRWS and GWS algorithms. The percentage is calculated in 
terms of number of packets, that is the figure plots the percentage of packets that hop 
from a wavelength to another. This percentage is around 30% for the RRWS, that is a 
fairly big number. GWS significantly reduces the number of re-assignment, up to 
almost the half in the case of optimal values of D. This is the result of the wavelength 
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assignment performed for groups on output link that maximizes output grouping 
index and of the wavelength re-assignment policy that tends to maintain the grouping 
index high thus reducing the average rate of congestion events. 




Fig. 3. Comparison of the percentages of connection re-assignment for the RRWS and GWS 
for N=4, W=16, M=3, G=0.3125, B = 16, average load per wavelength 0.8. 

Figure 4 compares the packet loss probability for the RRWS and GWS, showing 
that the saving outlined in figure 3 is obtained while also maintaining practically the 
same packet loss performance. 



6. Conclusions 

In this paper we have shown that, in the presence of MPLS connections (LSPs) over a 
DWDM optical packet switched network, the wavelength domain can be used to 
improve performance. In particular dynamic re-assignment of connections to 
wavelengths has been considered, which exploits connection topology properties. The 
results showed that with the application of limited contention algorithms, dynamic re- 
assignment is particularly effective in relation to static mapping, being very close to a 
connectionless environment. An algorithm that exploits the grouping properties has 
also been defined to reduce the number of connections re-assigned thus simplifying 
switching functions. 
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Fig. 4. Comparison of packet loss probability for the RRWS and GWS for N=4, W=16, M=3, 
G=0.3125, B = 16, average load per wavelength 0.8. 
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Abstract. DAVID (Data And Voice Integration over D-WDM) is a research 
project sponsored by the European Community aimed at the design of an opti- 
cal packet-switched network for the transport of IP traffic. The network has a 
two level hierarchical structure, with a backbone of optical packet routers inter- 
connected in a mesh, and metropolitan areas served by sets of optical rings 
connected to the backbone through devices called Hubs. The paper focuses on 
nodes and Hubs architecture, and on the operations of the media access proto- 
col to be used in the DAVID metropolitan area network. A simple access proto- 
col for datagram (not-guaranteed) traffic is defined and its performance are ex- 
amined by simulation. 



1 Introduction 

The DAVID (Data And Voice Integration over D-WDM) project is part of the 1ST 
(Information Society Technology) Program sponsored hy the European Community. 
Its aim is the design of an optical packet-switched network for the transport of IP 
traffic over metropolitan, national and international distances. 

The DAVID network is designed to offer an optical transport format independent 
of the traffic type; the clients of the DAVID network are mainly IP routers and/or 
switches that collect traffic from legacy networks. The network is based on a hierar- 
chical architecture consisting of several metropolitan area networks, named DAVID 
Metro networks, interconnected by a wide area optical backbone. We focus on the 
DAVID Metro network in this paper. 

The DAVID Metro Network consists of several uni-directional slotted optical 
physical rings interconnected in a star topology by a Hub. No optical buffering is 
required in the Metro; all the buffering is done in electronics at access nodes. The 
Hub functionality is ring interconnection; since the Hub is buffer-less, it behaves 
basically as a space switch. Ring interconnections are dynamically modified at the 
Hub following a scheduling algorithm. The aim of the scheduling algorithm is to 
provide an amount of bandwidth to ring pairs close to instantaneous (short-term) 
bandwidth requirements. The scheduling is based both on measurements at the Hub 
and on congestion signals issued by nodes. A WDMA/TDMA based MAC (Medium 
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Access Control) protocol is defined to regulate access to shared network resources. A 
fairness protocol is proposed to guarantee throughput fairness among nodes on each 
ring. 

The remainder of the paper is organized as follows. In Section 2 we give an over- 
view of the DAVID network architecture. In Section 3 we focus on the metropolitan 
network describing both the node and the Hub architecture. In Section 4 the MAC 
protocol and the scheduling algorithm at the Hub are described. In Section 5 we pre- 
sent some preliminary simulation results to assess the performance of the proposed 
scheme. We conclude the paper in Section 6, where we describe future research direc- 
tions. 



2 Network Architecture 

An overview of the two-level DAVID network architecture is shown in Fig. I: several 
Metro networks are interconnected by a wide area network (WAN) backbone. Both 
network parts operate in packet switched mode. The backbone network consists of 
optical packet routers interconnected by a mesh network, while each Metro network 
comprises one or more rings interconnected through a Hub. Each ring collects traffic 
from several nodes and each Hub is connected to an optical packet router in the 
WAN. Access points to the network are provided both in the Metro network and in 
the WAN, and the traffic is collected by IP routers and switches connected to local 
area networks (LANs). 




Fig. 1. General overview of the DAVID network 

The network uses a mixed WDMA/TDMA access protocol: each fiber carries up to 
32 wavelength channels at 2.5 or 10 Gbit/s and time is divided into fixed size slots, 
each carrying an optical packet which consists of a header and a payload. 

In packet switched networks buffering inside routers is needed to solve contentions 
arising among packets arriving in a given node and headed to the same output port. In 
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the DAVID WAN, optical packet routers provide buffering in the optical domain by 
means of fibre delay lines. 

No packet buffering in the optical domain is instead performed for packets flowing 
among ring nodes in the same Metro network. In a similar way, optical buffering is 
completely avoided along the node-to-Hub path for traffic exchanged among Metro 
nodes and nodes outside the Metro. Indeed, packets are buffered in ring nodes in the 
electrical domain, and are sent on the Metro network only when there are enough free 
resources on the Metro to travel from source to destination without being stored at 
any intermediate node. Thus, buffers are pushed towards the edge of the Metro net- 
work and sharing of rings resources among nodes must be regulated by a properly 
designed MAC protocol. 

The interfaces between WAN and Metro network are critical points where conten- 
tions involving traffic flowing between the backbone and the Metro networks might 
arise. This is worsened by the fact that optical packets could need either format or bit- 
rate translation (or, eventually, both) while travelling up and down the network hier- 
archy. Therefore, buffering and translation functions are implemented in Gateways 
placed between optical packet routers and Hubs (see Fig.l). 

In DAVID, the Hub has connection points to the Metro rings and towards a WAN 
Optical Packet Router through a Gateway. A certain number of Hub ports (wave- 
lengths) are devoted to connections towards the Gateway. The remaining Hub ports 
connect the Hub to the optical packet rings of the MAN. Since the Hub is buffer-less, 
as described later, it performs space switching and wavelength conversion only. Opti- 
cal/electrical memories are present in the Gateway to solve contentions in the time 
domain for optical packets going from Metro network to WAN and vice versa. More- 
over, the Gateway will participate in the MAC protocol, such that, from a logical 
point of view, the connections from and to the Gateway appear to the Hub as addi- 
tional Metro ring connections. 

We will focus on the DAVID Metro network in the remainder of the paper. 



3 Metro Network 

In general, a DAVID Metro Network consists of several uni-directional optical physi- 
cal rings interconnected in a star topology by a Hub. On each fibre, a fixed number of 
wavelengths is available by WDM partitioning. Logical rings can either be physically 
disjoint (i.e., run on different fibres), or be obtained by partitioning the optical band- 
width of one fibre into disjoint portions. Nodes belonging to the same logical ring 
access the same set of shared resources. Recall that one logical ring may represent the 
WAN/MAN gateway functionality. In the remainder of the paper we use the term ring 
to identify a logical ring; any reference to physical rings will be explicit. The number 
of rings in a Metro network is denoted by Af 

While the number of wavelengths on each ring can be in general different, we as- 
sume that it is a multiple of the same number. In DAVID demonstrators this is dic- 
tated by technological constraints, since SOA arrays are used at each ring node to 
select the wavelengths from/to which packets are received/transmitted. Up to 32 
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wavelengths are available on each physical ring (fibre), and all wavelengths run at 
either 2.5 or 10 Gbit/s. We also assume that all the nodes of a ring can transmit and 
receive on any wavelength used in that ring. The latter is a rather essential assump- 
tion, since the access scheme would be much more complex if nodes could have a 
limited tunability on the wavelengths of the ring they belong to. In particular, in this 
paper we assume for simplicity that the same number of wavelengths (A(,^^=4 wave- 
lengths) is available on any ring. 

Ring resources are shared by the nodes of the Metro network using a statistical 
time/wavelength/space division scheme. Indeed, 

• each wavelength is time slotted (TDM) and the slot duration is about 500 ns, 

• several slots are simultaneously transmitted through wavelength division (WDM), 

• rings can be disjoint in space (SDM). 

Thus, resource sharing is based on a WDMA/TDMA scheme, i.e. a combination of 
Wavelength Division Multiple Access and Time Division Multiple Access. 

Time slots are aligned on all wavelengths of the same ring, so that a multi-slot (a 
slot in each wavelength) is available to each node in each time slot. Slot alignment 
among different Metro rings is dealt with at the Hub; we assume for simplicity that 
the propagation delay on each ring is an integer multiple of the slot size. One of the 
wavelengths (hence a slot in each multi-slot) is devoted to management and network 
control purposes. We assume that this control slot can be read and written by all 
nodes independently of their data transmissions and receptions in other slots of the 
multi-slot. The control information contained in a multi-slot refers to data slot in the 
same multi-slot; thus, a delay is added in each node to process information contained 
in the control slot. Wavelengths are (dynamically) assigned to ring-to-ring communi- 
cations by the Hub on a time-slot basis: all the wavelengths in the multi-slot are de- 
voted to transmissions to a given destination ring, identified with a label in the control 
slot. Any wavelength in the multi-slot may be used by a ring node to reach any node 
in the destination ring. 

Metro ring nodes are subject to collisions and receiver contentions. By collision, 
we mean multiple transmissions in the same time slot, the same wavelength and the 
same physical ring. By receiver contention we mean having in the same multi-slot 
and the same ring a number of packets (in different wavelengths) to be received by a 
given node larger than the number of receivers available at that node. 

Both collisions and contentions are avoided at each source node thanks to the 
MAC protocol, by monitoring the state of the incoming multi-slot, and giving priority 
to in-transit traffic. To avoid collisions, no new packet can be transmitted on a busy 
channel; to avoid contentions, if the number of packets in the current multi-slot for a 
given destination exceeds its capacity (i.e. number of receivers), no new packet can 
be transmitted to that destination. 

It is important to observe that contentions may arise also at the Hub; contentions 
are avoided by defining the Hub as a space switch and by running a proper slot 
scheduling algorithm. 
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3.1 Ring Node Architecture 

We assume that the number K of transceivers at each Metro ring node is smaller than 
the number of WDM channels; this means that a node can only transmit and receive 
on at most K channels at the same time, i.e. in each multi-slot. We typically consider 
the case K=\\ thus, each node has a single tunable transceiver: tuning actions are 
executed before transmitting and receiving independently at the transmitter and the 
receiver. We also assume that all the nodes of a ring can transmit and receive on any 
WDM channel used in the ring they belong to. 

The board of a ring node is basically composed of two parts: an optical part and an 
electronic one. For the optical part, the ring node can drop, add and erase any packet 
on any wavelength at each time slot; switching is forbidden for in-transit traffic, in 
the sense that no operation is allowed on data not addressed to the node. Data are 
taken off the ring when they arrive at their destination. 

The electronic part is composed of the following portions; a segmentation (reas- 
sembly) stage to create fixed size data units from variable size packets (viceversa), a 
queuing stage, in which packets are grouped and stored per destination ring to avoid 
HoL (Head of the Line) blocking [1], and a load balancing stage, to distribute the 
packets evenly over the available wavelengths. The HoL blocking is typical of FIFO 
queues: a packet at the head of the FIFO queue that cannot be transmitted to avoid 
collisions or contentions on the ring may prevent a successful transmission of another 
packet following in the FIFO order. Note that this queue architecture is very similar to 
the VOQ (Virtual Output Queue) architecture used in IQ (Input Queued) switches [2], 
where, at each input port, packets are stored in separate queues on the basis of the 
destination port they should reach. 

Since resources (multi-slots and wavelengths) in DAVID are allocated to ring-to- 
ring communication, queues are organized per ring destination, i.e., at each node a 
FIFO queue is available to store packets directed to all the nodes belonging to a given 
ring. This avoids HoL blocking due to collision avoidance (since multi-slots are asso- 
ciated with destination rings), but does not solve HoL blocking due to receiver con- 
tentions, which would require a per-destination-node queuing scheme. The consid- 
ered per-destination-ring queuing is however simpler to implement and to control, 
and scales much better to large network configurations. 

The electronic interface is used also to solve the contention problem by running the 
MAC protocol and to drive the packet insertion on the ring in a free slot. 



3.2 Hub Architecture 

The role of the Hub is to switch packets between Metro rings, and from Metro rings 
towards the WAN (and vice-versa). Being all-optical, the Hub includes only a space 
switching stage, a wavelength conversion stage, and a WDM synchronisation stage; 
3R regeneration may be added if necessary. Note that the target switching capacity in 
DAVID, given that in a typical Metro network M„g=4 rings running 32 wavelengths at 
10 Gbit/s are envisioned, is 1.28 Tbit/s. 
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In every time slot, the Hub operates a permutation from input rings to output rings, 
as depicted in Fig. 2 for the case of four rings. This permutation is the same for all 
wavelengths of each ring and is known for each time slot in each ring: we can assume 
that each multi-slot is labeled by the Hub with the identity of the ring to which pack- 
ets transmitted in the multi-slot will be forwarded by the Hub. 



ring 1 
ring 2 
ring 3 
ring 4 




ring 1 
ring 2 
ring 3 
ring 4 



Fig. 2. A ring-to-ring permutation at the Hub 



Since we are assuming that the number of wavelengths in each ring is the same, no 
congestion occurs at the Hub: each incoming multi-slot can be forwarded to Hub 
outputs. The Hub must act as a non-blocking switch that is re-configured in every 
time slot. It does not have to operate in the time domain, but it may have to perform 
wavelength conversion when the wavelengths used in the input ring are different 
from those used in the output ring (this always happens when the two rings are ob- 
tained in wavelength division on the same fibre). 




Fig. 3. Scheduling at the Hub 

The computation of the sequence of permutations operated by the Hub is a sched- 
uling problem [3, 4], as shown in Fig. 3. Several approaches can be envisaged to 
solve this problem, ranging from complex optimisations to simple heuristics, and are 
based onto an estimation of the ring-to-ring traffic pattern (note that the complexity of 
the scheduling problem depends on the number of rings, not on the number of nodes: 
this allows good scalability features). The scheduling algorithm is described in Sec- 
tion 4.4. 

Given this Hub behaviour, each multi-slot traverses a sequence of rings, e.g. as il- 
lustrated in Fig. 4, where roman number indicate successive positions of the multi- 
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slot, the upper slot is the control slot where the multi-slot destination ring is written, 
and numbers within the multi-slot represent node destinations. Nodes of ring x trans- 
mit data to be received by nodes of ring y (Steps II to IV). Ring x can be viewed as 
the “upstream” ring, where transmissions occur, while ring y can be viewed as the 
“downstream” ring, where receptions occur. Note however that when the considered 
multi-slot traverses the downstream ring y (Steps VI to VIII), it gathers transmissions 
for the next ring, say ring z, so that the traversal of a ring can be viewed as a down- 
stream path for transmissions done in the previous ring, and as an upstream path for 
receptions in the following ring. 

Space reuse of slots is possible in the DAVID Metro: a node receiving a packet 
leaves free the corresponding slot, which can be reused in the same ring, possibly by 
the same receiving node, for another transmission (see the transmission from node 1 
to node 3 in Step I of Fig. 4). This also means that, in the example above, transmis- 
sions on upstream ring x can also be directed to other nodes of ring x (in addition to 
transmissions to nodes of downstream ring y). Note that transmissions to destinations 
belonging to the same ring of the source node must go through the Hub when the 
destination precedes the source in the ring, hence Hub permutations in which the 
input and the output ring are the same are possible and required. 




Fig. 4. Multi-slot forwarding in the MAN. Number in slots represent packet destinations 




Access Control Protocols for Interconnected WDM Rings in the DAVID Metro Network 45 



We inhibit these slot reuse capabilities in our simulation experiments, and force all 
traffic to pass through the Hub before being removed from the ring. 



4 MAC Protocol and Scheduling at the Huh 

In this section, we first describe the contention and collision problem in a DAVID 
Metro. Then, a simple access control scheme is proposed, and a fairness control is 
introduced to overcome the unfair behaviour of ring architectures. Finally, the 
scheduling algorithm at the Hub is discussed in detail. 



4.1 Contention and Collision Resolution 

Receiver contentions are not recoverable (packets would be lost), unless very com- 
plex receiver architectures are used. The proposed approach to solve contentions and 
collisions avoids packet losses in the path from the source node to the destination 
node, and is presented in the sequel. It is mainly achieved by the nodes, so that the 
operations and the implementation of the Hub are drastically simplified. In particular, 
no packet buffering, nor packet switching in the time domain, is required at the Hub. 



4.2 The Access Control Scheme 

In the description of the access control scheme, we assume for simplicity that the 
number of wavelengths supported on each ring is the same. 

The choice of a ring for the DAVID Metro network significantly impacts the un- 
derlying framework in which the MAC protocol operates. Although the generic solu- 
tions befitting switches with VOQ architecture can be adapted to the ring topology, 
the nature of the ring, where the signal has to pass through all nodes taking a round 
trip time for the collection of reservations, makes token based solutions more advan- 
tageous for this environment. 

The status of each slot of the multi-slot is reflected in suitable fields of the control 
slot. Each node that has packets to send must monitor the control wavelength seeking 
an empty slot in any X of a multi- slot that will be forwarded by the Hub to the corre- 
sponding destination ring. The node grabs the slot by setting the corresponding slot 
status field also adding the destination address in the relevant field. The node must 
check before grabbing the slot that the intended destination does not already appear in 
as many other Xs as the number of available tunable receivers (K=l in this paper), in 
which case it refrains from getting this slot and waits for the next opportunity. 

Ring nodes also monitor the control wavelength looking for any instance of their 
address, in which case they tune to the indicated X to receive the data contained in the 
corresponding slot. Again we assume at each node a delay in processing multi-slots 
larger than the tuning time required to set the receiver to the proper X. 
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In summary, receiver contentions are solved assuming that the source node knows 
how many receivers are available at the destination node: transmission of a packet is 
forbidden if the number of packets sent by upstream nodes in the current multi-slot to 
the destination exceeds the reception capacity. To avoid collisions, an empty-slot 
protocol is used: incoming slots are inspected, and transmission is permitted only if 
the slot in some wavelength is free, i.e., no upstream node transmitted in that slot and 
that wavelength. Note that this gives some advantage to upstream nodes, i.e., to nodes 
preceding others along the signal propagation direction: a given node can be com- 
pletely starved by continuous transmissions of upstream nodes. This raises fairness 
issues, so that a protocol that provides fairness control is needed. 



4.3 Fairness Control 

As noted above, the proposed empty-slot operation can exhibit fairness problems 
under unbalanced traffic; this is particularly true in the ring topology, in which up- 
stream nodes have generally better access chances than downstream nodes. 
Credit-based schemes, such as the Multi-MetaRing [5] previously studied in the con- 
text of single ring can enforce throughput fairness. MetaRing [6] was proposed by Y. 
Ofek for ring-based, electronic metropolitan area networks. It is basically a generali- 
sation of the token-ring technique: a control signal or message, called SAT, is circu- 
lated in store-and-forward mode from node to node along the ring. A node forwarding 
the SAT is granted a transmission quota: the node can transmit up to Q packets before 
the next SAT reception. When a node receives the SAT, it immediately forwards the 
SAT to the next node on the ring if it is satisfied (hence the name SAT), i.e. if 

• no packets are waiting for transmission on the ring, or 

• Q packets were transmitted since the previous SAT reception. 

If the node is not satisfied, the SAT is kept at the node until one of the two conditions 
above are met. Thus, SAT are delayed by nodes suffering throughput limitations, and 
SAT rotation times increase whit the network load. To be able to provide the full 
bandwidth to a single node, the quota Q must be at least equal to the number of data 
slots contained in the ring, i.e., proportional to the ring latency (propagation delay) 
measured in slot times. In overload, each node sends exactly Q packets per SAT rota- 
tion time. 

In the case of the DAVID MAN, several rings exist, and multi-slots traverse pairs 
of rings. We therefore need a SAT for each ring pair (upstream ring, downstream 
ring). SAT signals can be carried in the multi-slot control wavelength. 

The Hub must be able to store SATs, where is the number of rings at- 
tached to the Hub. Since SATs do not carry any information, boolean variables 
SAT.^ do the job; SAT.^ is TRUE when the SAT regulating transmissions from ring i 
to ring j is at the Hub. When the Hub issues on ring i a multi-slot that will be 
switched, upon return to the Hub, to ring j, if the SAT i^j is currently at the Hub 
(i.e., if SAT;^=TRUE), the SAT is loaded in the control slot of the multi-slot, by set- 
ting a suitable bit, and by setting SAT_.^ to EALSE. 




Access Control Protocols for Interconnected WDM Rings in the DAVID Metro Network 47 



Each node inspects the control slot of incoming multi-slots, and operates on SATs 
as described above for the single ring case. Recall that each queue is regulated by a 
different SAT and transmission opportunities are regulated by a MetaRing quota Q 
that may be different for each queue; however, the quota Q must be greater or equal 
to the ring latency to allow a single node to grab all the available bandwidth; thus, 
since in this paper we assume that all ring latencies are equal, we use the same value 
of the quota Q for all queues. 

SAT are also used to trigger congestion notification signals from ring nodes to the 
Hub. This information is used by the Hub to determine the scheduling in successive 
frames as described later. 



4.4 A Simple Scheduling Algorithm 



We describe the approach followed to compute the scheduling at the Hub; the algo- 
rithm is run in a centralised fashion at the Hub. Multi-slots are labelled at the Hub 
according to the outcome of the scheduling algorithm, using the control slot to iden- 
tify the ring to which the multi-slot will be forwarded upon return to the Hub. Only 
unicast transmissions are considered, i.e. multicast transmission are considered as 
multiple unicast transmissions. 

The Hub scheduler is driven by an request matrix R. Each element in 

R contains the number of multi-slots that must be transmitted from input ring i to 
output ring o, i.e., the number of multi-slots labelled with o in the control channel that 
the Hub must send on ring i, and, upon arrival at the Hub, switch to ring o. This re- 
quest matrix is obtained by mixing periodic measurements and congestion signals 
issued by nodes as described in Section 4.4. 1 . 

According to combinatorial theory [7], R can be scheduled in at most F time slots, 
where the frame length F is equal to: 



F = 




by using a sequence of F switching matrices P(i), ie { 1, 2,..., F}, of size A 

switching matrix is a binary matrix whose element is 1 when input ring i is con- 
nected to output ring o, and 0 otherwise. The resulting scheduling in F is then re- 
peated an integer number of times, until a new value for R becomes available and a 
new matrix decomposition can be computed. Traffic flows from ring i to ring o are 
served with a rate proportional to R. /T’. 

Since each input ring can be connected to at most one output ring and each output 
ring can be connected to at most one input ring in each time slot, a switching matrix 
always contains at most one non-null element in each row and in each column. Thus, 
the sum of each row and column in P is either equal to 0 or 1 . Each switching matrix 
represents the Hub switching configuration in a given time slot; recall that we need to 
obtain a set of F ring permutations as the outcome of the Hub scheduling algorithm. 
Thus, in each time slot one and only one element from each row and one and only 




48 A. Bianco et al. 



one element from each column must be equal to 1 in P. In other words, we are inter- 
ested in doubly stochastic switching matrixes, i.e. matrices P such that 

^P,„=l, Vo V/ 

i o 

The outcome of the scheduling algorithm is a sequence of F doubly stochastic 
switching matrices; this scheduling satisfies a matrix where each row and each 
column sum to F, a condition that in general does not hold for R. We artificially add 
integer quantities, representing ring to ring multi-slot requests, to some elements in 
the original matrix R, to obtain the matrix R*^ to be scheduled. Any algorithm can be 
used to obtain a matrix R*^ satisfying this condition; see e.g. [8]. 

The matrix R may be associated with a bipartite graph G having 2 nodes. Each 
node represents either one input or one output of the switch, and input node i is con- 
nected to output node j by one edge only if R,„aO. A matching on G is a subset E of 
the edges in G such that, each node in G is incident to at most one edge in E. The 
number of edges in E is the size of the matching, and a matching is said to be maxi- 
mum when it has maximum size. 

We may apply a maximum size algorithm [9] on R*^ to obtain the Hub scheduling, 
i.e., a sequence of doubly stochastic P(i), ie {1, 2,..., E). 

Another possible algorithm that may be used is a critical maximum matching on R. 
Any input i for which 



and any output o for which 

^ 

i 

is said to be critical, since it must be served in every time slot if R must be scheduled 
in F slots. The request matrix R is decomposed into F switching matrices through 
iterated application of the critical maximum matching algorithm [4]. A critical maxi- 
mum matching is a maximum matching which covers all the critical input and output 
nodes. 

At step i, the decomposition algorithm computes the switching matrix P(i) as a 
critical maximum matching on R. When the matching has size lower than matrix 
P(i) is completed so that all input rings are always connected to all output rings. Fi- 
nally, P(i) is subtracted from R, and a new iteration is started. 

At the end, the matrices P(i) are randomly shuffled to uniformly distribute ring to 
ring pairs on the F time slots of the frame, to reduce traffic burstiness. 

4.4.1 Traffic Measurement 

The request matrix R used by the scheduling algorithm is estimated on the basis of 
traffic measurements performed at the Hub during consecutive observation windows 
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(OW); the duration of each OW is fixed and roughly equal to 10 ring propagation 
times. 

The key idea of the algorithm is that, as long as the network is not overloaded, the 
throughput is a good estimator of the offered load. When one or more traffic relations 
among different ring pairs become overloaded, congestion control mechanisms are 
introduced to modify the bandwidth allocation in the network. Note that overloading 
conditions depend on the scheduling at the Hub. If the scheduling determined at the 
Hub is not matched to the traffic distribution, some ring experience overloading con- 
ditions until the scheduling is not modified, since the scheduling determines band- 
width allocation among ring-to-ring pairs. 

The matrix R that must be scheduled is computed, at the end of each OW, as the 
sum of 3 contributions matrices): 

R = rSM-Hj3lC-H7ECl 

with p and /positive constants where: 

• SM (smoothed measure) is a measure of the (long term) average number of multi- 

slots transmitted among ring pairs, where each element is a real number ranging 
between 0 and OW; this is an absolute throughput measure 

• IC (implicit congestion) is the percentage of filled slots, where each element is 

represented as a real number between 0 and 1 ; this is a relative throughput meas- 
ure 

• EC (explicit congestion) takes into account explicit congestion signals sent by ring 

nodes, where each element is either 0 or 1 . 

The Hub stores in each element of matrix M (measured) the number of packets 
flowing from ring i to ring o during each OW. The matrix M is then passed through 
an exponential filter to smooth out measurement errors, obtaining matrix SM. Thus, a 
new value for SM is computed at the end of each OW as a function of the last meas- 
ured matrix M and of the values assumed by SM at the end of the previous OW: 

SM_=aSM„„ + (l-a)M/iV,,„ 

where a e [0,1] is a constant, is the number of wavelengths channels available 
on a logical ring, which we assume to be equal for all Metro rings; matrix M is di- 
vided by to convert number of packets in number of multi-slots. Therefore, ele- 
ment SM;^^ of SM is the average number of multi-slots transmitted from ring i to ring 
o during one OW, roughly averaged over the last 1/a observation windows. 

Matrix IC gives the ring to ring connections throughput measured at the Hub, i.e., 
the occupation of scheduled slots. Each element IC_^^ is the ratio of the number of 
packets sent from ring i to ring o over the number of slots available for transmission 
on the same traffic relation in one OW. If IC, , is close to 1, this is a signal of potential 
congestion between i and o. 

Matrix EC is a binary matrix which provides information on the ring congestion 
level on the basis of nodes queue length. Congestion signals are triggered at nodes by 
SAT transmissions. Each node on ring i, when releasing SAT.„, checks the length of 
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the queue toward ring o; if the queue exceed a given threshold, i.e. it contains more 
than > Q packets (where Q is the MetaRing quota), the node sends, on the control 
channel, a congestion signal to the Hub. Note that we use the control channel to send 
congestion signal to the Hub instead of SAT messages, since SAT messages may be 
delayed by downstream nodes experiencing difficulties in channel access. Each ele- 
ment EC_ , in EC is set to 1 at the Hub, if the Hub has received at least one congestion 
signal toward ring o from a node on ring i during the last OW. The value of is 
related (equal in our simulation experiments) to the MetaRing quota Q-, the rationale 
is that the quota represents, for each node, transmission opportunities toward a given 
ring between two consecutive node SAT reception. We assume congestion if the 
number of packets already in the queue when releasing the SAT is greater than the 
MetaRing quota, since the node will not be able to transmit all the packets in the 
queue in the following SAT rotation time. 

Note that the two congestion signals operate on two different time scales: the first 
indication, stored in IC, is related to the observation window, which is fixed; the 
second indication, stored in EC is triggered by SAT arrivals, and depends on the SAT 
rotation time, which in turn depends on the number of nodes in the network. More- 
over, the implicit congestion signal can be used as an early congestion signal indica- 
tion, to trigger an increase in slot allocation to a given ring pair without waiting until 
the queue size in a node exceeds the threshold. 

The presented algorithm has some important properties that we want to highlight. 
Suppose that the network is not overloaded, since the scheduling algorithm at the Hub 
provides enough slots (bandwidth) to each ring-to-ring traffic relation. This means 
that the scheduling determines a slot allocation “matched” to the offered traffic, i.e. a 
slot allocation that satisfies all traffic relations, which are never congested. This is the 
solution we would like to obtain with our algorithm under stationary traffic condi- 
tions. Congestion signals are never issued, since nodes do not experience congestion. 
Thus, the frame length is determined by the scheduling on matrix R=SM; the meas- 
ured average slot occupation is proportional to OW via the network load p. All the 
simulation results show that if the network is not overloaded, the frame length is close 
to this value. This feature is obtained because the measurement interval is fixed. If we 
had a variable measurement interval proportional to the frame length, we would have 
obtained a shrinking frame length, since each measurement would create a matrix R 
where each element is on average reduced by a factor p with respect to the value 
assumed in the previous interval. On the other hand, a fixed measurement interval 
raises the problem of deciding a value for such interval, which indirectly decides also 
the granularity in bandwidth allocation and control. Recall that we chose the meas- 
urement interval to be equal to 10 ring latencies in our simulation experiments. 

Finally, we must ensure that the scheduling provides at least a multi-slot for each 
ring to ring pair, i.e. at least a set of covering permutations must be scheduled in the 
frame, so that at least one multi- slot is available in each ring to send packets to any 
other ring. Otherwise, if no traffic exist on a given ring-to-ring pair, it is not possible 
to measure any slot occupation, the SAT cannot be sent and explicit congestion sig- 
nals cannot be raised by nodes and no implicit congestion signal may be measured at 
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the node. We enforce the scheduling to provide this set of covering permutations 
in each frame. 



5 Simulation Results 

We present some simulation results to assess the performance of the proposed access 
scheme. We do not exploit the space reuse capability described at the end of Section 
3.2: if a multi-slot on ring x is labelled with destination ring z, it is used only to send 
traffic to nodes in ring z. Moreover, we force each multi-slot on ring x labelled with 
destination ring x (inter-ring traffic) to pass through the Hub; this is required to allow 
the Hub to perform traffic measurement for all ring pairs. 

In our simulation experiments the Metro network comprise rings, with 10 

nodes on each ring. For each ring-to-ring communication A(,^^=4 data channels are 
available; thus, each multi-slot comprise 5 slots, 4 for data traffic and 1 for control 
and management. Each nodes store packets in 4 queues, one for each destination ring. 
Each queue is 1000 packets long, and the packet size is matched to the slot size. 

The values used in our simulation experiments for the parameters defined in the 
measurement algorithm are the following: {3=1, Y=3, 0=0.9. The ring round trip time 
is assumed equal to 44 time slots, the MetaRing quota is 2=44 and the threshold 
The observation window is OW=440 time slots. 

We consider two traffic patterns: a uniform traffic pattern and an unbalanced traf- 
fic pattern. Define the weight matrix W, of size where the value assumed by 

each element W, , is a real number ranging between 0 and 1 representing the percent- 
age of traffic generated on ring i toward ring o with respect to the total network load 
p. Clearly, 

^W,„<1, Vo ^W,„<1, Vi 

i o 

In the uniform traffic pattern W;^=l/Af„g Vi,o. Eor the unbalanced traffic pattern 
Wj„=0.7 when i=o, and W,,=0.1 otherwise; in other words, the ratio among intra-ring 
traffic and inter-ring traffic is 7. 

Packets are generated at ring nodes according to a Bernoulli distribution whose av- 
erage is derived from the weight matrix described above. 

We first plot the throughput (ratio between used and allocated slots) for each desti- 
nation ring on ring 0; this is a steady-state value obtained using statistically signifi- 
cant measures by simulation. Note that, although we plot the throughput for a single 
ring, the same behaviour holds for all other rings due to ring symmetries. Nodes on 
the same ring do not exhibit throughput unfairness thanks to the MetaRing algorithm. 

In Eig. 5 we report the throughput for each destination ring on ring 0, and the 
overall network throughput (black square markers) as a function of offered load under 
uniform traffic. Each destination ring is treated fairly and the total network utilisation 
is close to 0.95. Note that we significantly overload the network, since p ranges from 
0.1 to 3, but the algorithm behaves well even under this extreme condition. 
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Fig. 5. Throughput under uniform traffic 

The 5% utilisation loss is small, given the complexity of the system, and it can be 
shown to be mainly due to receiver contentions, which can be analytically evaluated 
with a combinatorial analysis. 

Fig. 6 shows the frame length as a function of time. We start with a uniform 
scheduling with a frame equal to slots. As expected, the system converges to a 
frame length roughly equal to OW x p; once this value is reached, the frame length 
changes slowly following traffic fluctuations. The convergence speed is determined 
by the value of the parameter a. 




Fig. 6. Frame length as a function of time under uniform traffic 
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In Fig. 7 we report the throughput for each destination ring on ring 0, and the 
overall ring throughput (black square markers) as a function of offered load under 
unbalanced traffic. For values of p ranging from 0.1 to 1, the throughput is propor- 
tional to the weight matrix defined for the unbalanced traffic scenario. As soon as the 
offered load p increases to values that create congestion, the scheduling algorithm 
treats all ring-to-ring connections fairly according to a max-min like fairness criteria 
[10]; the intra-ring throughput decreases steadily until it reaches the same throughput 
obtained by inter-ring connections. Also in this scenario each destination ring is 
treated fairly and the total network utilisation is close to 0.95. 
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Fig. 7. Throughput under unbalanced traffic 

We examine in Fig. 8 the bandwidth allocation determined by the scheduling algo- 
rithm for ring 0 under unbalanced traffic for p=0.6 (similar curves are observed for 
other values of p). The allocation is sampled at interval lasting OW, the observation 
window. The ideal scheduling algorithm would allocate steadily bandwidth equal to 
0.7 for the connection from ring 0 to ring 0, and 0.1 to all other inter-ring connec- 
tions. In our experiment, the initial scheduling algorithm is matched to a uniform 
traffic pattern, which is clearly not optimal for unbalanced traffic. We can observe a 
transient behaviour of less than 2000 slot times (roughly 4 observation windows); this 
value depends on the choice made for the parameters defined in the measurement 
algorithm. Then, the allocation is close to the optimal one, with some small variations 
of few % around the ideal value; these differences are due to traffic fluctuations, to 
which the scheduler tries to adapt the bandwidth allocation, and to inaccuracies in the 
traffic measurement process. The choice of the parameters should be optimised to 
control these fluctuations under all traffic conditions. We observed that the algorithm 
does not exhibit any drift from the optimal values also under heavily loaded condi- 
tions. 
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Fig. 8. Bandwidth allocation for unbalanced traffic with p=0.6 



Finally, in Fig. 9 we show the queue occupancy (in packets) in overload (p=2.0) 
for a given node on ring 0 (all other nodes show similar queue length behaviours), 
sampled every 100 slot times. Whereas the queue length for intra-ring traffic saturates 
since this connection is overloaded, all other queues show oscillating behaviours, 
since each inter-ring connection becomes congested only when the scheduling does 
not allocate enough slots to this connection. Remarkably, although the algorithm aims 
only at fair bandwidth allocation, the queue occupancy level is fairly well controlled, 
at values smaller that 100 packets, a value not far from the ring propagation time, the 
time constant under which any bandwidth control cannot be achieved in this network. 
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Fig. 9. Queue length for unbalanced traffic with p=2.0 
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6 Conclusions and Future Work 

Although the presented simulation results are encouraging and the algorithm shows 
good performance, several issue remain to be addressed. 

First, other traffic scenarios should be studied to prove the algorithm robustness to 
different environments. Different traffic pattern should be examined, and traffic gen- 
eration should be extended from Bernoulli to on-off and/or heavy-tailed traffic mod- 
els. 

Then, an accurate analysis of the effect of the parameter setting must be provided, 
to obtain a set of values that provides good performance under different conditions. 
Transient behaviours must be carefully analysed to test the ability of the algorithm to 
follow short-term traffic fluctuations. 

Finally, we want to extend the proposal to deal with multiple classes of traffic, to 
provide QoS guarantees similar to those of the DiffServ environment defined by the 
IETF for Internet. 
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Abstract. In this paper the Reactive Local Search (RLS) heuristic is 
proposed for the problem of minimizing the number of expensive Add- 
Drop multiplexers in a SONET or SDH optical network ring, while re- 
specting the constraints given by the overall number of fibers and the 
number of wavelengths that can carry separate information on a fiber. 
RLS complements local search with a history-sensitive feedback loop that 
tunes the prohibition parameter to the local characteristics of a specific 
instance. 

Simulation results are reported to highlight the improvement in ADM 
cost when RLS is compared with greedy algorithms used in the recent 
literature. 



1 Introduction 

During the last few years, dramatic advances in optical communications have led 
to the creation of large capacity optical WANs, usually in the form of hierarchies 
of rings. As shown in Fig. Q these rings are joined together via hubs in hierarchies 
at various levels. 

In the framework of Wavelength Division Multiplexing (WDM) the overall 
bandwidth of the fiber is divided into densely packed adjacent frequency bands. 
In this manner, every fiber can carry a large number of high-capacity channels. 
Dividing these channels into lower-bandwidth virtual connections between cou- 
ples of nodes gives rise to technical difficulties: wavelengths are at most a few 
hundreds, therefore they need to be time-multiplexed to be shared among many 
couples of communicating nodes. The key components to achieve time multiplex- 
ing, the Add-Drop Multiplexers (ADM), are need edeach time a wavelength has 
to be processed electronically at a node in order to add or drop packets. There- 
fore they represent a significant fraction of the total cost of the infrastructure. 

We consider the problem of dividing the available bandwidth of a single ring 
into many channels to establish as many node-to-node connections as requested. 
In the following, the term virtual connection shall be used to refer to an elemen- 
tary node-to-node communication channel. 

In Fig. El we describe the structure of a node in the ring. Optical bandwidth 
is shared among communication channels in two ways (both of which are im- 
plemented in state-of-the-art optical networks): the incoming signal is initially 
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Fig. 1. A SONET/SDH hierarchy of rings 



split into its component wavelengths by means of a Wavelength Demultiplexer 
(WDEMUX), then some wavelengths are directly relayed to the next node, while 
some others are further split into packets via a time division multiplexing mecha- 
nism operated by Add-Drop Multiplexers (ADM) . ADMs are critical components 
(they need to exploit fast packet conversion, header examinations, and must be 
tightly synchronised), therefore they are expensive. 

In this paper we shall consider unidirectional WDM rings where each wave- 
length can be time-shared among g virtual connections. In other words, the 
bandwidth of a wavelength is assumed to be g (also called grooming factor), 
while single virtual connections, carried by a wavelength in a single time slot, 
have a unit bandwidth. 

The paper is organized as follows. Section |21 describes the Reactive Local 
Search heuristic and summarizes the reasons leading to its choice for the WDM 
Traffic Grooming problem. Section 0 defines the problem, and summarizes ap- 
proaches that have already been used for its solution. Section0|is devoted to the 
application of RLS to the WDM traffic grooming problem, with a description of 
the design choices and of the data structures involved in the process. Section 0 
analyzes simulation results, giving an experimental comparison among previous 
heuristics and the proposed technique. 



2 Reactive Local Search 

Let us consider a discrete optimization problem, where the system configuration 
X is given by parameters varying in a discrete set C, for example a binary string 
{C = {0,1}") or an integer vector (C = IN"), and the optimality of a given 
configuration is evaluated by a cost function / : C — >■ IR. 
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Many interesting optimization problems are known to be computationally 
unaffordable, and the WDM minimization problem has been proved NP-hard 
m- Local Search techniques are a family of heuristics aimed at finding near- 
optimal solutions to hard problems, optimizing the value of the cost function by 
local modifications of the system configuration. 

A simple example is given in Fig.Ol where a repeated local search technique is 
applied to the problem of minimizing a function of a binary string. Here “local” 
means that the step from a solution to the next is performed by flipping a single 
bit of the string. In other words, the neighborhood of a solution is given by all 
solutions having Hamming distance 1 from it. 

The basic steps of the algorithm are lines EHEl Here a random configura- 
tion is generated (line^, then local steps are repeatedly performed by evalu- 
ating the objective function on all neighbors and finding the best improvement 
Hines rUOl. If / is bounded from below, this leads to a local minimum of the 
cost functiorfl, and by repeating the whole procedure with new random starting 
points Hines n H I Oil the optimum will eventually be found. Once a local minimum 
is found the algorithm will update the best configuration Hines fT^I 5ll and jump 
to a completely new starting point. 

A substantial improvement to the previous scheme is obtained when one 
considers that problem instances are not completely unstructured. Usually, local 
minima tend to cluster, and once one is found it is advisable to continue explor- 
ing its vicinities rather than starting immediately from a new random point. The 
basic scheme cannot do that, because it is forced to explore only towards improv- 
ing solutions. A possible modifications of this approach is Simulated Annealing, 
which accepts - with a given probability - solutions that lead to increases of the 

^ Unless a neighborhood with equal cost is reached (a “plateau”): modifications of the 
algorithm in Fig. 01 would be necessary in this case 
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var X, best_x: binary string; n, cost, best_cost: integer; 
best^cost < hoo 

repeat 

generate a random solution into x 
cost ■«— f{x) 

repeat 

r let n be one of the bits of x whose flipping most decreases 
the value of f{x) 
if a decrease is possible then 
r x[ri\ = not x[n] 

1_ 1_ cost t— f{x) 

while an improvement is possible 
if cost < best_cost then 
r best^cost •<— cost 
1_ best_x x 

until some termination condition is satisfied 



Fig. 3. The basic local search algorithm on a binary string 



cost function; notice that Simulated Annealing is a Markovian heuristic, where 
the next configuration produced is not influenced by past history, but only by 
the current configuration. 

Another branch of local search heuristics consists of non-Markovian tech- 
niques, such as prohibition-based heuristics that forbid some moves according to 
criteria that take into account past moves. Prohibition-based schemes date back 
to the sixties . and have been proposed for a growing number of problems 
in the eighties with the term Tabu Search (TS) ^ or Steepest Descent- Mildest 
Ascent 0 . Fixed TS (see the classificatin proposed in 0 ) implements the Local 
Search technique with two major modifications: 

1. it does not terminate a run when the local minimum is found and 

2. once a move (for example a bit flipping) is made, the reverse move (i.e. flip- 
ping that bit again) is prohibited for a given number of iterations T (called 
prohibition period). 

Although various TS schemes have been shown to be effective for many prob- 
lems, some of them are complex and contain many possible choices and parame- 
ters, whose appropriate setting is a problem shared by many heuristic techniques 
In some cases the parameters are tuned through a trial-and-error feedback loop 
that includes the user as a crucial learning component: depending on preliminary 
tests, some values are changed and different options are tested until acceptable 
results are obtained. The quality of results is not automatically transferred to 
different instances and the feedback loop can require a lengthy process. 

On the other hand, reactive schemes aim at obtaining algorithms with an 
internal feedback (learning) loop, so that the tuning is automated. Reactive 
schemes are therefore based on memory, information about past events is col- 
lected and used in the future part of the search algorithm. The TS-based Reac- 
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tive Local Search (RLS) adopts a reactive strategy that is appropriate for the 
neighborhood structure of the problem: the feedback acts on a single parameter 
(the prohibition period) that regulates the search diversification and an explicit 
memory-influenced restart is activated periodically. RLS has been successfully 
applied to a growing number of problems, from Maximum Clique 0 to Graph 
Partitioning |2|. 

In this paper, RLS is adapted for the ADM minimization problem in WDM 
traffic grooming. 



Let N be the number of nodes in the ring, and let M be the number of available 
wavelengths, computed as the number of fibers times the number of wavelengths 
per fiber. The problem is not affected by the values of the two factors, since the 
only important number is the overall number of wavelengths. 

Every physical link from a node to its neighbor is capable of carrying M 
wavelengths; at every node, some of these wavelengths will be simply relayed to 
the outgoing link by an internal fiber without electronic conversion, while some 
others will be routed through an ADM, which is therefore necessary only if some 
of the traffic contained in that wavelegth is directed to, or originated from, that 
node. 

Let g be the grooming factor, i.e. the number of low-bandwidth channels that 
can be packed in a single wavelength on a fiber. For example, if the wavelengths 
are carrying an OC-48 channel each, a grooming factor g = 16 means that traffic 
is quantized into 16 OC-3 time-multiplexed channels, while if 5 = 4 only 4 OC-12 
time- multiplexed channels will be carried. 

Another fundamental parameter is the traffic pattern, an N x N matrix T 
whose entry tij is the number of time-multiplexed low-bandwidth unidirectional 
virtual connections required from node i to node j. Note that, being the ring 
unidirectional, there is no reason to consider the matrix as symmetric, and that 
the diagonal elements must be null: tu = 0. 

Let P be the overall number of virtual connections along the ring: P = 
J2ij tij ■ A solution to the problem can be given by an fV x TV matrix W whose 
entry Wij is an integer array of tij elements (thus empty if i = j), one for each 
virtual connection from node i to node j. Thus, the wavelength assigned to 
the fc-th virtual connection from i to j {1 < i, j < N, 1 < fc < tij) is Wijk 
(1 < Wijk < M). 

The number of AD Ms required at node f (1 < * < TV) is the cardinality of 
the set of wavelengths assigned to virtual connections that originate from, or go 
to, node i: 



3 The WDM Traffic Grooming Problem 
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Because CHi{W) is a set, multiple occurrences of the same wavelength are 
counted once (indeed, only one ADM is required to deal with them) . The overall 
number of AD Ms needed for a given wavelength assignment W is 

N 

f{W) = '£^CH,{W). (1) 

i=l 

Wavelength assignment matrix W is subject to the constraint that no wave- 
length should be overloaded in any fiber of the ring. The fiber segment exiting 
from node n (and going to node (n mod N) -|- 1) is traversed by all virtual con- 
nections (i, j) where i<n<j or n<j<i or j<i<n. The load of wavelength 
I on the outgoing fiber of node n is given by the cardinality of the set 

WLni = |(i, j,fc) : 

{1 < i < n < j < N \/ 1 < n < j < i < N \/ 1 < j < i < n < N) 

A ^ < k < Uj A Wijk = I 

The load constraint will assert that 

yn,l {l<n< N Al< I <M)^#WL„i<g. (2) 

The WDM traffic grooming problem can be stated as follows. 

WDM Grooming Problem: 

Given integers N>0, M>0, g>0 and the N x N traffic matrix 
= i^ij) 

Find a wavelength asignment 

W = (wijk) l<k<Uj, l<w^jk<M) 

that minimizes the objective function dU subject to the load constraint (0. 

Most papers on the traffic grooming problem propose combinatorial greedy 
algorithms |4l5lll)j . For example, j^] suggests some techniques for different kinds 
of traffic matrices, notably the egress-node case where all traffic is directed to- 
wards a single hub node, the uniform all-to-all case and a more general distance 
dependent traffic. Two types of algorithms are presented. Some algorithms at- 
tempt to maximize the number of nodes requiring just one ADM, then those 
requiring two and so forth. Others try to efficiently pack the wavelengths by 
dividing nodes into groups and assigning to different wavelengths intra-group 
traffic. 

4 Reactive Local Search for the Grooming Problem 

To implement RLS in an efficient way, the problem needs to be formulated in 
terms of an integer array optimization. To transform the wavelength assignment 
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matrix into an integer array we simply associate an array entry to each virtual 
connection. 

First of all, all indices start from 0, so in the following sections nodes indices 
will vary from 0 to — 1 and channel numbers from 0 to M — 1. All virtual 
connections between nodes are enumerated consecutively, so that the fc-th virtual 
connection from node i to node j is assigned index 



so that any node couple {i,j) is assigned a consecutive group of Uj paths. The 
overall number of paths is, of course. 



The wavelength assignment matrix W is stored into an P-entries integer array 
S = (si), where the wavelength assigned to the fc-th virtual connection from 
node i to node j is stored into Sp.^^. . 

The objective function m and the load constraint 10) are given in Sect.0 

To implement a local search heuristic we need to define the neighborhood of 
a given configuration S. The basic move we chose is equivalent to changing the 
wavelength assignment of a given virtual connection, i.e. to changing a single 
entry of the current configuration array S. 

Last, we modified the objective function to take into account load constraint 
violations by adding to it a penalty term given by the number of violations 
multplied by a very large constant (larger than NM, i.e. the maximum number 
of AD Ms in the system). For this reason we needn’t explicitly check for a non- 
violating string: while minimizing the objective function the number of violations 
is automatically reduced. 

We show in Fig. 2]an outline of the Reactive Local Search algorithm used for 
the WDM grooming problem. The function accepts three parameters (line 0) : 
the maximum number of moves max_moves, the integer array best_x which is 
going to contain the best found solution and the cost function ADMs. 

In the first section (lines 2HZJ variables are declared. The array x contains 
the current problem configuration, T contains the current prohibition period 
(time after which a given move can be repeated), while the array LastExecuted 
is indexed by moves and contains the last iteration a move was performed (— oo 
if it was never executed) . By move we mean the local step from a configuration 
to the next. In our tests a move is represented by a couple of integers {i,n) 
meaning that wavelength n is assigned to the i-th virtual connection. 

The initialization section ninesFV ll 611 generates a random configuration, ini- 
tializes the prohibition period T and sets all entries in LastExecuted to minus 
infinity (they have never been performed). 

The solution improvement cycle (lines (ElEEJ is made of two distinct parts. At 
first a suitable move is found: the move must be the “best” in the sense that, after 
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1 . function RLSJor .Grooming {max.moves: integer; 

2 . by_ref best.x: array of integer; 

3. function ADMs (array of integer): integer); 

4. var 
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6 . 
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9 . 

10 . 
11 . 
12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 

23. 

24. 

25. 

26. 

27. 

28. 



K array of integer; 

n, cost, best.cost, T, mv, step: integer; 

LastExecuted: array of moves; 

Initialization: 

generate a random assignment into x 

best_x s; 

cost ADMs{x)\ 

best.cost -4— cost; 

T4- 1; 
for each mv 

LastExecuted[mv] < oo; 

Improvement: 

for step in l..max.moves do 

r find the best move mv such that LastExecuted[mv] < step - T 
modify a;[i] according to mv 
cost 4— nADM(a;); 

LastExecuted[mv] 4— step 
if cost < best.cost then 
r best.cost 4— cost 
best.x 4— X 

if X has been visited too often then 
increase T 

else if T hasn’t been increased for some time 
decrease T 



Fig. 4. The RLS algorithm for the ADM minimization problem. The details about the 
reactive prohibition scheme determining T Tines EIiIhuisI) are given in the text. 



some nonprohibited moves have been checked (as it may be impractical to check 
all possible moves, we chose to stop after checking 1000 possible candidates), a 
move is chosen that most decreases the ADMs function or, if it is impossible, 
it increases ADMs as little as possible. After the move has been performed, 
the configuration string is updated and the new value of the cost function is 
computed, compared with the best value and eventually stored. 

After every move, a reaction step is performed (lines ESIEHI : a configuration 
dictionary is checked for the current configuration. If it has already been visited, 
then the prohibition period T is increased by some amount (10% in our test), 
while if no configuration has been repeated for some time the value of T is 
reduced (again, by 10%). 

The implementation of the Reactive Local Search algorithm has been done 
in C++ language. The program operates on the P-entry array, while a 64-bit 
(long long int) hash fingerprint of each visited configuration is used to index 



64 



R. Battiti and M. Brunato 



a LEDA dictionary structure containing relevant data such as the iteration 
number and the number of times that configuration has been visited. 

The initial prohibition period is 1 (in this case a move cannot be undone in 
the next step) . If the number of configurations that have been visited more than 
three times exceeds 3, the prohibition time is increased by a 10% amount and 
rounded to the next integer. If prohibition time has not been raised for a certain 
number of steps, then it is decreased by the same amount (a high prohibition 
time facilitates escaping from local minima, but prevents a large fraction of 
neighboring configurations from being explored). 

5 Simulation Results 

We tested the algorithm on the all-to-all uniform traffic case, where the traffic 
requirement is equal to 1 for all couples of nodes: 
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Figure El shows a comparison between the RLS results (best value after 10 
10®-step runs) and Modiano’s algorithm for all-to-all uniform traffic with one 
channel request per couple of nodes. We considered cases g = 4 and g = 16, as 
they are analitically studied in 0. In both cases RLS results in a considerable 
reduction of the number of ADMs in the ring, up to 28% for q = A and up to 
31% for g = 16. 

In Fig. 0 we compare the best solution found after 10 RLS runs (already 
reported in Fig. EJ with the average value for the same set of runs of the algo- 
rithm. In order to distinguish the two lines we were forced to restrict the graph 
to the higher portion of tested values (12 to 16 nodes) and to a grooming factor 
equal to 16. In fact, most runs return the best found value, and only one or two 
in the total end with one or two more ADMs. 

The RLS algorithm took about 8 minutes of CPU time per run on a 500MHz 
Pentiumlll computer with 64 MB RAM running the Linux operating system. 
The size of the problem (the number of nodes) did not affect the execution time 
because we fixed to 1000 the number of neighboring configurations to check. 

6 Conclusions 

Experimental results obtained by simulations of the all-to-all uniform traffic case 
show that the proposed RLS technique is competitive with other greedy tech- 
niques used in the literature. Of course, local search heuristics need to explore 
a large set of solutions to find good local minima; this is acceptable, because 
circuit planning is an off-line operation, where a computation taking a few min- 
utes is perfectly tolerable, in particular when it cuts down hardware costs in a 
significant way. 



Number of ADMs Number of ADMs 



Reactive Search for Traffic Grooming in WDM Networks 



Modiano, g=4 — i — 
Reactive search, g=4 — -x— 
Modiano, g=16 








— 














































































. ■ ■■ 


c 

f 








„ [ 


^ 

d 


^ 

1 








; 




1 


1 


1 










8 9 10 11 12 13 14 15 

Number of nodes 

Fig. 5. Comparison between RLS and Modiano 



Average 

Minimum 


g=16 4- 

g=16 - 


1 


1 






1 








































































































1 


1 


1 






1 





12.5 13 13.5 14 14.5 15 15.5 

Number of nodes 



Fig. 6. Minimum, average and confidence interval for RLS, g — 16 





66 



R. Battiti and M. Brunato 



References 

1. Battiti, R.: Reactive search: Toward self-tuning heuristics. In V. J. Rayward- 
Smith, editor, Modern Heuristic Search Methods, chapter 4, pages 61-83. John 
Wiley and Sons Ltd, 1996. 

2. Battiti, R., Bertossi, A.: Greedy, prohibition, and reactive heuristics for graph 
partitioning. IEEE Transactions on Computers 48 (1999) 361-385 

3. Battiti, R., Protasi, M.: Reactive local search for the maximum clique problem. 
Algorithmica 29 (2001) 610-637 

4. Berry, R., Modiano, E.: Reducing electronic multiplexing costs in SONET/WDM 
rings with dynamically changing traffic. IEEE Journal on selected areas in com- 
munications 18 (2000) 1961-1971 

5. Chiu, A., Modiano, E.: Traffic grooming algorithms for reducing electronic multi- 
plexing costs in WDM ring networks. IEEE Journal of Lightwave Technology 18 
(2000) 2-12 

6. Glover, F.: Tabu search — part I. ORSA Journal on Computing 1 (1989) 190-206 

7. Hansen, P., Jaumard, B.: Algorithms for the maximum satisfiability problem. 
Computing 44 (1990) 279-303 

8. Kernighan, B., Lin, S.: An efficient heuristic procedure for partitioning graphs. 
Bell Systems Technical Journal 49 (1970) 291-307 

9. Steiglitz, K, Weiner, P.: Some improved algorithms for computer solution of the 
traveling salesman problem. Proceedings of the Sixth Allerton Conference on Cir- 
cuit and System Theory (Urbana, IL, 1968) 814-821 

10. Wan, P.-J., Calinescu, G., Liu, L., Frieder, O.: Grooming of arbitrary traffic in 
SONET/WDM BLSRs. IEEE Journal on Selected Areas in Communications, 18 
(2000) 1995-2003 




Micromobility Strategies 
for IP Based Cellular Networks 



Joseph Soma-Reddy and Anthony Acampora 



University of California, San Diego 
9500 Gilman Drive, La Jolla, CA 92093 USA 
soma@cwc.ucsd. edu, acamporaSece .ucsd.edu 



Abstract. With the increasing popularity of IP based services on mobile 
devices, future cellular networks will be IP based and designed more as an 
extension to the Internet rather than the telephone network. Managing 
mobile hosts within a IP based network is a challenge especially when fast 
handoff are required. We propose a micromobility scheme that achieves 
very fast handoffs with no wireless overhead. We compare our protocol 
with other proposed schemes. 



1 Introduction 

The evolution of cellular networks into the third generation and beyond will 
drastically change the nature of mobile applications and services. While mobile 
telephony has been the dominant application so far, the high data rates available 
in future will cause IP based services like Internet access, email etc. to become 
more prevalent. The nature of the mobile devices will also change from simple 
cellphones to IP based hosts like PDA’s, handheld computers, laptops etc. 

Eventually, the hybrid voice and data 3’’'^ generation networks will evolve into 
All-IP cellular networks which will be just an extension to the Internet. Base sta- 
tions would simply be routers with wireless links and will be connected directly 
to the Internet. Telephony will be offered as an application using VoIP (voice over 
IP). Such a network will be truly integrated with the Internet and will enable 
easy creation and deployment of new mobile services. 

One of the main challenges facing the development of such networks is mo- 
bility management. The Internet protocols have been designed for fixed hosts. 
The IP address of a host identifies not a particular host but rather its point of 
attachment. Thus a mobility management protocol is needed to handle mobile 
users and to ensure that they receive service as they move within the All-IP 
cellular network. 

The Mobile IP protocoipj is the Internet standard for managing mobile hosts. 
While it is a scalable protocol, it has been designed for nomadic hosts and its 
high latency and overhead make it unsuitable for cellular networks that require 
fast and frequent handoffs. Instead a hierarchical approach is needed with two 
levels of mobility. A micromobility protocol that handles fast mobility within a 
small region and a macromobility protocol that handles mobility between these 
regions. 
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Various micromobility protocols have been proposed. However, all these pro- 
tocols require exchange of signaling packets between the mobile and the cellular 
network during a handoff. Since cellular links have high delay, these protocols 
suffer from significant handoff latency. When VoIP is used, these protocols in- 
troduce noticeable speech disruption during a handoff. 

In this paper, we propose a new micromobility protocol that does not require 
any exchange of signaling packets over the air during a handoff. When the mobile 
moves to a new base station, it is able to acquire and register a new IP address 
without any over the air signaling. This ensures that the handoff can be com- 
pleted quickly and with no overhead on the wireless link. There is no disruption 
of speech during a VoIP call. The protocol does not impose any more infrastruc- 
ture requirements or scalability constraints compared to other proposed schemes. 
We conclude that the protocol compares favorably with other proposed schemes 
and can be used along with a macromobility protocol like Mobile IP to achieve 
fast, scalable handoffs. 

In section 2 we describe the Mobile IP protocol and the hierarchical mobility 
management architecture. Micromobility schemes proposed in the literature are 
described in section 3. We introduce our micromobility protocol in section 4 and 
compare it with other schemes in section 5. 

2 Mobile IP 

The Mobile IP protocol P has been standardized by the IETF to provide mobility 
to Internet hosts. This protocol allows a mobile host to maintain its permanent 
IP address while changing its point of attachment. The operation of the protocol 
is illustrated in Fig. 1. The mobile host has a permanent address (called the 
Home Address) assigned to it. The mobile is identified by this address and other 
hosts that wish to communicate with the mobile use this address. In addition, 
the mobile acquires a temporary Foreign Agent address (also called the care of 
address) at its current location. The mobile registers the care o/ address with its 
Home Agent node, which is located on the mobile’s home network. When the 
mobile is at a foreign location, the Home Agent intercepts packets addressed to 
the mobile and encapsulates them in new packets addressed to the mobile’s care 
of address. The Foreign Agent receives these packets and, after decapsulation, 
delivers to the mobile. The encapsulation and decapsulation is transparent to 
the corresponding host and to the transport and higher layers in the mobile 
which see only the permanent address of the mobile. A slight variation of this 
protocol, called the colocated care of address option, has no Foreign Agent and 
the decapsulation function is performed by the mobile. 

The Mobile IP protocol is a very scalable protocol. It only requires a Home 
Agent and Foreign Agent(not necessary with colocated care of address option) 
and the rest of the network infrastructure can be left unchanged. Since the mobile 
always has a topologically correct address at its current location, packet routing 
is done on the basis of network prefixes, which is a very scalable method. Hence 
the protocol can operate well in a network of global scale, like the Internet. 
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However, when it is applied in situations that demand fast mobility, like cellular 
networks, the high overhead and latency of the protocol make it inefficient. At 
each handoff, the mobile node has to acquire a care of address and register it 
with the Home Agent (which may be in a distant location). This process can 
take a long time (up to one second) and requires the exchange of several packets 
between the mobile, foreign network and Home Agent. 



CH 




Fig. 1. Mobile IP 



Scalability and low latency are two important requirements for a mobility 
management protocol for the global cellular network. However, these two re- 
quirements are contradictory and it is difficult to design a protocol that is both 
fast and scalable. Hence, a hierarchical mobility management architecture with 
two levels of mobility is used. This is illustrated in Fig. 2. The global cellular 
network is divided into domains, each of which spans a small geographic area. 
A domain consists of a Gateway router, several intermediate routers and a col- 
lection of base stations all of which are under the administrative control of a 
single authority. A micromobility protocol would handle mobility between base 
stations within a single domain and a macromobility protocol would handle mo- 
bility between domains. The micromobility protocol must be able to achieve fast 
handoffs but need not be scalable since it is used only within a domain which 
is of limited size. The macromobility needs to be scalable but not necessarily 
fast since it only handles handoffs between domains, which is not a frequent 
occurrence. It needs to be a standard protocol, since it has to operate between 
different domains which could be under different administrative authorities. Mo- 
bile IP, being an IETF standard and a scalable protocol, is a natural choice for 
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this role. Thus we need a micromobility protocol that can execute fast handoffs 
in order to achieve global mobility management in a fast, efficient and scalable 
manner. 




B: Base Station 



I: Intermediate Router 



Fig. 2. Hierarchical Mobility Management 



3 Other Micromobility Protocols 

3.1 Description 

Several micromobility protocols have been proposed in the literature [2lifl4| . In|2|, 
Mobile IP itself has been proposed to be used as a micromobility protocol. 
Since the Home Agent will be located in the Gateway router, close to the mo- 
bile(because a domain will consist of a limited geographical area), the latency 
due to mobile registration with the Home Agent will be small. Other optimiza- 
tions like using link layer triggers to detect a handoff immediately(rather than 
waiting for Foreign Agent or Router advertisements) can help reduce handoff 
latency. Work is progressing in this direction in the Mobile IP Working Group. 

In0, a new micromobility protocol called Gellular IP has been proposed. 
This protocol is different in that the mobile uses the same IP address at different 
locations. Instead, the routing tables are modified at each handoff so that packets 
addressed to the mobile are routed to its new location. When the mobile moves 
to a new base station, it sends a route update packet to the Gateway router. 
As the route update packet travels to the Gateway router, the routers along 
the way change their routing entry for the mobile in the reverse direction. Thus 
symmetric routing paths are created for the mobile and updated at each handoff. 
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3.2 Performance 

A common feature of all the proposed micromobility protocols is the exchange of 
packets between the mobile and a node within the fixed network(Home Agent or 
Gateway Router) at every handoff. This exchange of packets causes both high 
latency and overhead in the mobility protocol. The cellular channel is unreli- 
able and time varying due to effects like path loss, shadowing and fading. In 
order to counter these effects, particularly fading, diversity techniques like cod- 
ing, interleaving etc. are used. The effect of these techniques is to increase the 
delay on the cellular link. For example, the cdma2000 air interface specifies a 
20ms interleaver. Since the complete frame must be received before it can be 
de-interleaved, the interleaver alone introduces a delay of between 20 and 40ms. 
Various other factors like coding, processing and queuing delays also contribute 
to the overall delay. The typical cellular link has been known to suffer a delay 
of between 60-100ms. Delays for packet data systems could be much large r(due 
to higher queuing delays, ARQ schemes etc.). GPRS networks have a link de- 
lay of between 80-200ms and Ricochet networks (a cellular packet data network 
available in some cities in US) have link delays of the order of 230-500ms. 

Glearly, a handoff protocol that relies on the exchange of packets over cellular 
links would suffer significant latency. For example. Mobile IP(with colocated care 
of address) would require the exchange of 4 packets(two packets to acquire an 
IP address and two to register the address with the Home Agent). This alone 
would cause a handoff latency of 240-400ms. Assuming a 20ms delay for the 
radio level handoff and a 10ms delay on the wired network, the total handoff 
latency would be 280-440ms. Gellular IP, on the other hand, requires exchanging 
2 packets(route update packet and acknowledgment). The handoff latency would 
be 160-240ms. 

In a delay tolerant data connection, a handoff latency of a few hundred mil- 
liseconds will lead to either delayed or lost data packets and will not substantially 
affect the quality of the connection. However, for a delay intolerant service like a 
VoIP call, handoff latency leads to lost packets. Thus a VoIP call made over a IP 
based cellular network using a Mobile IP based mobility protocol would suffer 
a speech loss of about 240-400ms(12-20 consecutive frames at 20ms a frame) at 
each handoff and a speech loss of 160-240ms(8-12 consecutive frames) at each 
handoff if Gellular IP were the mobility protocol. The speech loss could be worse 
if header compression techniques are used(and they must, otherwise the header 
overhead would be too much for VoIP), since such techniques usually propagate 
packet loss when several consecutive packets are lost. Thus, it is not possible to 
execute handoffs without audible glitches during a VoIP call. And this problem 
will be exacerbated in future cellular networks which will achieve higher system 
capacity through smaller cell sizes. This leads to increased rate of handoffs which 
will worsen the above problem. 
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4 Our Micromobility Protocol 

4.1 Protocol Description 

The high handoff latency in the proposed schemes is caused largely by the high 
delay on the cellular links. Thus, to achieve low latency handoffs, we eliminate 
any signaling over the cellular link. This not only achieves low latency handoffs, 
but also reduces signaling overhead on the cellular link. 

When the mobile moves to a new base station, it needs to acquire a new IP 
address that is topologically consistent with its new location. Since we propose 
to not do any signaling over the cellular link during a handoff, the new IP 
address must either be assigned to the mobile ahead of time or the mobile must 
be somehow capable of determining the new IP address. While it is possible 
to pre-assign to the mobile, IP addresses at every base station in the domain, 
a better alternative would be for the mobile to determine the IP address at 
each base station based on some unique radio level parameters that have been 
already exchanged as part of the radio level handoff. The radio level identifiers 
that can be used to determine the IP address of the mobile will depend on the 
particular cellular network that is in operation. Thus, when the mobile moves 
to the new base station, it(and the network) can determine its new IP address 
without exchanging any IP level packets over the air. 




Fig. 3. Our Micromobility Protocol 



For example, consider the domain shown in Fig 3. In this domain, the tem- 
porary IP address of a mobile connected to a base station is derived as follows: 
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a mobile identifier mi of length 16 bits is derived from some unique radio level 
identifier of the mobile, say its IMSI(international mobile subscriber identifier). 
Similarly, a base station identifier bi of length 8 bits is derived from a radio 
level identifier of the base station. The two identifiers are then concatenated and 
prefixed with the network id to form an IP address. Thus, mobile mi when con- 
nected to base station bi has a temporary IP address 10.6i.TOi( where 10.*.*.* is 
the network id of this domain) . Thus mobile 1 when connected to base station 1 
has temporary IP address of 10.1.0.1 and when connected to base station 2 has 
temporary IP address of 10.2.0.1. When the mobile moves to a new base station, 
the identifiers mi and bi are derived from the radio level identifiers exchanged 
during the radio level handoff and the new temporary IP address is constructed 
by both the mobile and the new base station without exchanging any packets. 

A link layer trigger is used to inform the network layer(in the mobile and base 
station) of the handoff so that there is no delay in handoff detection. Finally, 
the change of IP address of the mobile is kept transparent to the higher layer 
through encapsulation. 

4.2 Protocol Operation 

When a mobile firsts connects to a base station within a domain, it is assigned 
a public IP address that will be used by the mobile as long as it is within the 
domain. In addition, the mobile will determine its temporary IP address at the 
base station based on the radio level parameters exchanged during the radio 
level handoff. The base station informs the gateway of the mobile’s temporary 
IP address. 

Packets addressed to the public IP address of the mobile are intercepted by 
the Gateway router and encapsulated in a new IP packet that is addressed to 
the temporary IP address of the mobile. Since this is a topologically consistent 
address, the packets will reach the mobile. At the mobile, the packets are de- 
encapsulated and sent to the higher layers which are unaware of the temporary 
address. 

The mobile continuously measures the signal strengths of the pilot transmis- 
sions of nearby base stations. When the mobile decides to handoff to another 
base station (because the received pilot signal strength from it is stronger than 
the current base station), it simply registers with the new base station at the 
radio level. A link layer trigger informs the network layer in the mobile and the 
new base station of the handoff. The mobile and the base station independently 
determine the new temporary IP address of the mobile at that base station with- 
out exchanging any packets. The base station informs the Gateway router of the 
mobile’s new temporary IP address. The Gateway now begins to encapsulate 
packets addressed to the mobile with the new temporary IP address and they 
reach the mobile at the new location. 

Since no packets are exchanged over the air, the latency with this mobility 
management protocol is independent of the delay on the cellular link and is 
determined solely by the latency of the radio level handoff and the delay on 
the wired network(between the base station and the Gateway router). Using the 
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numbers from section 3, the handoff latency is 40ms. This corresponds to a loss 
of 2 speech frames in a VoIP call, which is unnoticeable by the user. 

5 Comparison 

Latency: With our micromobility protocol, the handoff latency is only 40ms 
while it is about 280-440ms with Mobile IP as the micromobility protocol and 
about 160-240ms with Cellular IP as the micromobility protocol. While this dif- 
ference may not be noticeable in a data connection, a VoIP call would experience 
noticeable speech disruption at every handoff with both Mobile IP and Cellu- 
lar IP protocols, while with our protocol there would be no noticeable speech 
disruption. 

Signaling Overhead: In our protocol, there is no over the air signaling and 
thus no wastage of wireless bandwidth. In contrast, with Mobile IP as the mi- 
cromobility protocol, 4 packets are sent over the air at every handoff and with 
Cellular IP as the micromobility protocol, 2 packets are sent over the air at 
every handoff. The amount of wireless bandwidth consumed by these signaling 
messages is not significant at the current rate of handoffs, but with increasing 
rate of handoffs in the future (because of shrinking cell sizes), this could become 
important. 

Scalability: Since the mobile always uses topologically consistent addresses, 
only one routing table entry per base station is required in the intermediate 
routers. This is true for Mobile IP also. However, with Cellular IP as the mi- 
cromobility protocol, the intermediate routers need to have one routing table 
entry per mobile (because the mobile uses the same address at different base 
stations). As the size of the domain increases and the number of mobiles per 
domain increases, our protocol and Mobile IP will be able to scale better. 

Infrastructure Requirements: The Gateway router and the base stations need 
to be modified to operate our protocol. The rest of the cellular infrastructure 
does not need any changes and standard IP equipment can be used. This is also 
true for Mobile IP. However, with Cellular IP as the micromobility protocol, the 
intermediate routers also need to be modified to support the protocol. 

6 Conclusion 

In this paper, we presented a micromobility scheme for an IP based cellular 
network. It achieves fast handoffs with no overhead no the wireless links. We 
compared the performance of our protocol with other proposed schemes and 
determined it is better. In conjunction with Mobile IP, it can be used for mobility 
management over a global IP based cellular network. 
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Abstract. Satellites are well suited for mobile Internet applications because of 
their capability to enhance coverage and support long-range mobility. Satellites 
are an attractive alternative for providing mobile access to the Internet in 
sparsely populated areas where high bandwidth UMTS cells cannot be 
economically deployed or in impervious regions where deployment of 
terrestrial facilities is not practical. In this paper we analyze various mobile 
Internet applications for both GEO and LEO satellite configurations flridium- 
like and Globalstar-like.) Eor the simulations, we use ‘ns2’ (Network Simulator 
2), enhanced to support LEO and GEO satellites and mobile terminals. As part 
of the ns-2 upgrade, we developed a channel propagation model that includes 
shadowing data from surrounding building skylines. We compute via 
simulations the performance of ETP applications when users are mobile, 
traveling along “urban canyons”. The results show that throughput and delay 
performance is strongly affected by skyline shadowing and that shadowing 
degradation can be compensated by satellite diversity, such as provided by 
Globalstar. 



1. Introduction 

The use of satellites for Internet traffic is a very attractive proposition since the wired 
network is highly congested. On the average, an Internet request needs to travel 
through 17 to 20 nodes, and hence may go across multiple bottlenecks. On the other 
hand, with satellites, just one hop is sufficient to connect two very distant sites. 

Several systems, operating or in development, have been planned to support the 
satellite traffic. In particular, it is worth mentioning Spaceway (Hughes) [1], Astrolink 
(Lockheed Martin) [2], Cyberstar (Loral) [3], SES-Astra [4], Eutelsat [5], SkyBridge 
(Alcatel) [6] and Euroskyway (Alenia Spazio) [7]. 

UMTS/IMT2000 (Universal Mobile Telecommunication System/International 
Mobile Telecommunications 2000) aims to be the platform for supporting new 
multimedia services. In this new system, the satellite component is supposed to be 
integrated with the terrestrial component [8]. Satellites, thus, assume a particularly 
important role, and not only complementary, especially if aiming at real global 
coverage and ensuring access to maritime, aeronautical and remote users. 
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The main goal of this paper is to evaluate the performance of typical mobile 
Internet applications, using different TCP schemes. We analyze some representative 
satellite scenarios, two existing LEO systems (Iridium [9] and Globalstar [10]) and 
the classical GEO configuration. We develop a channel propagation model based on 
shadowing from surrounding building skylines. The model parameters are based on 
actual data in a built-up area. 

The paper is organized as follows. Section 2 reviews the role of satellites. In 
Section 3, we highlight the characteristics of satellites likely to have an effect on TCP 
and other Internet applications. Section 4 introduces the “urban canyon” model for the 
study of the satellite channel in urban environments. Section 5 presents the simulation 
platform (NS-2) used in our experiments and describes the extensions required for 
LEO satellite support. In Section 6, we describe the experiments and present the 
experimental results in Section 7. Section 8 presents the conclusions. 



2. The Role of Satellites 

In the future, satellites will play an important role in providing wireless access to 
mobile users, for enhancing coverage of existing or developing land mobile systems 
and to ensure long-range mobility and large bandwidth. The presence of satellite links 
provides extra capacity and an alternate and less congested route to existing wired and 
wireless systems, thus offering unique opportunities for improving efficiency and 
reliability. The broadcast nature of the satellite channel also makes satellites suitable 
for broadcasting/multicasting. 

The satellite may also be integrated with short-range wireless networks (Bluetooth) 
in order to provide Internet services to mobile vehicles. In this scenario, the satellite 
will connect one vehicular terminal to the Internet while the Bluetooth system will be 
able to interconnect equipment inside the vehicle (car, bus, train, plane or ship). 

The above-mentioned systems [1-7], providing large bandwidth directly to users, 
are based on a geosynchronous orbital configuration. A single GEO satellite is able to 
cover a very wide area with multiple narrow spot beams using multibeam antennas. In 
this way, very reliable links can be provided even with small terminals. Such GEO 
systems provide (or are designed to provide) high availability high data rate links (up 
to 2 Mbit/s and more) and huge total capacity (of the order of 10 Gbit/s). Many of 
them are expected to provide regional limited coverage in a few years and global 
coverage (excluding poles) finally. 

As an alternative to GEO satellites, LEO constellations may also be considered. 
These have low latency and thus are more suitable for real time applications but they 
may need more than one hop to reach remote destinations. They may provide global 
or limited coverage depending on the inclination of orbits. An important issue with 
LEO satellites is that satellites move and therefore, connections have to be handed off. 
The only available LEO system (Globalstar [10]) provides low bit rate voice and data 
services; no broadband system is foreseen for the next few years. One LEO system, 
called Teledesic is being developed for the provision of broadband services [11]. 

Previous work [12-16] has addressed the performance of TCP/IP over 
geostationary satellites. This has mainly focused on evaluating performance of 
different TCP schemes (Tahoe, Reno, New Reno, SACK), where the connection 
between two fixed stations goes across a GEO satellite. 
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In this paper, we also address other TCP schemes (Westwood [16]) and evaluate 
TCP performance when terminals are mobile. Such an environment creates a very 
realistic scenario for the evaluation of performance in a satellite environment. 



3. Satellite Features Impacting Performance 
of Internet Applications 

First, when considering the performance of real time (e.g. voice and video) 
applications the large propagation delays represent a critical issue. The problem is 
particularly acute in TCP applications over links with large bandwidth-delay products, 
where a large TCP window is required for efficient use of the satellite link. A lossy 
satellite link can cause frequent “slow start” events, with significant impact on 
throughput. In the case of LEO networks, a delay variation is introduced due to the 
fast movement of satellites, which causes frequent handovers. The key satellite 
network features that need to be considered in order to evaluate the impact of satellite 
on internet application performance are: propagation delay, Delay-Bandwidth Product 
(DBP), frequent handover, signal-to-noise ratio (SNR), satellite diversity, routing 
strategy. 

Satellite systems are changing their role from the “bent-pipe” (transparent) channel 
paradigm to the on-board routing (regenerative) paradigm associated with packet 
transmission and switching. In this process, which involves both GEO satellites and 
non-GEO constellations, each satellite is an element of a network, and new design 
problems are involved both at the network and the transport layers. 

While considering the performance of standard Internet protocols over paths 
including satellite links, the large propagation delays have to be taken into account. 
The problem is particularly relevant to TCP due to its delay-sensitive nature. Eor 
voice and other real time applications too, the delay may be critical. The delay 
problem can be further accentuated in the case of LEO networks, where the delay may 
also be very variable and the satellite or gateway handover may occur frequently. 

Some of the main aspects that need to be considered in order to evaluate the Impact 
of system architecture on the performance of network protocols are: 



Propagation Delay 

Due to the large distance that packets need to travel in satellite networks, they can 
experience a significant transmission delay (depending on the type of constellation 
and the routing strategy adopted). In addition, this delay may also show a lot of 
variability when a LEO architecture is used. 

The transmission delay in a GEO architecture depends on the user-gateway 
distance (or user-user if direct connection is allowed) when a single satellite 
configuration with no ISLs (inter- satellite links) is considered, or on the connection 
strategy if ISLs are used. This delay can be considered constant in the case of fixed 
users while it is variable in the case of mobile users. 

In the case of LEO constellations, the delay variability is much greater than in the 
GEO case. In fact, the time variant geometry of the constellation induces a fast 
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variability of the user- satellite-gateway distance both for fixed and for mobile users. 
This phenomenon is even more critical if ISLs are implemented because the routing 
strategy may also play a very important role. In [20], a simplified analysis, 
considering the delay as constant, is presented. 

The delay variability is particularly relevant to TCP, since it causes the round-trip- 
time to be variable and may affect the estimate of Round Trip Time (RTT) by TCP. 
Whenever a change in RTT occurs, it takes some time for TCP to adapt to the change. 
If the RTT keeps changing (as can happen in case of LEO constellations), TCP may 
not be able to update its estimate of RTT quickly enough. This may cause premature 
timeouts/retransmissions, reducing the overall bandwidth efficiency. 



Delay-Bandwidth Product (DBP) 

The performance of TCP over satellite links is also influenced by the Delay- 
Bandwidth-Product (DBP). This has been evaluated through simulations and 
experiments [13, 14, 15]. This performance can be enhanced by adopting techniques 
such as selective acknowledge and window dimensioning [13][15]. Also, techniques 
such as TCP spoofing and cascading may be used [15]. 



Frequent Handover 

In a connection-oriented service with LEO satellites, each time a handoff occurs, a 
sizeable number of TCP packets can get lost, especially if the signaling exchange is 
not performed carefully. Also, after the handoff is complete (and packets are lost), 
TCP has to scale down its congestion window to 1 . 



Satellite Diversity 

Satellite diversity is a very efficient technique for improving link availability or SNR 
by utilizing more than one user- satellite link to establish communication. In a 
connectionless system, the satellite diversity feature has the advantage of increasing 
the probability of packet delivery to/from the satellite network from/to Earth. 
However, this increases the number of busy channels, which causes a reduction in 
capacity. In the experiments in this paper, we see how the satellite diversity in 
Globalstar helps to improve performance over Iridium. 



Signal-to Noise Ratio (SNR) 

In GEO constellations, the SNR (or equivalently Bit Error Rate, BER) is 
characterized by a great variability due to free space losses and tropospheric 
propagation (including rain, over 10 GHz) for fixed communications and also due to 
shadowing in case of mobile communications. Both shadowing and deep rain fading 
can cause packet loss. 
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In LEO systems, the BER variability may be caused, in addition to the previously 
mentioned phenomena, by the radio link dynamics due to the continuously changing 
link geometry (antenna gain and free space losses). 

Poor SNR is an extensively studied issue for GEO satellites. Since TCP has been 
designed mainly for wired links, it assumes that the BER is very low (of the order of 
10 '°, or less) so that when a packet is lost, TCP attributes it to network congestion. 
This causes TCP to employ its congestion control algorithms, which reduces the 
overall throughput. This is detrimental in satellite links, and causes an unnecessary 
reduction in throughput. The effects due to poor SNR extend to LEO and hybrid 
LEO/GEO networks. 



4. Mobile Satellite Channel Model 

It is well accepted that signal shadowing is the dominant critical issue influencing 
land mobile satellite (LMS) systems availability and performance. While multipath 
fading can be compensated through fade margins and advanced transmission 
techniques, blockage effects are not easy to mitigate, resulting in high bit error rates 
and temporary unavailability. The solution to reduce such shadowing effects is path or 
satellite diversity [19]. 

Shadowing is mainly due to blockage and diffraction effects of buildings in urban 
areas and mountains or hills in rural areas, although absorption losses through 
vegetation may also significantly impair the radio channel in a number of 
environments including urban. Non-GEO orbit satellite networks have dynamic, yet 
deterministic topologies and the use of multiple visible satellites and satellite diversity 
techniques is needed to improve link availability. 

Constellations for positioning satellite systems are designed so that even if them 
some of the satellites are shadowed, the number of satellites is sufficient to get the 
data needed for position calculation. In the case of communication satellite systems 
too, constellations must take into account the shadowing occurrence. Also, since 
voice and data applications are affected by shadowing differently, the impact of 
satellite diversity needs to be studied separately for them. 

Models based on physical or physical-statistical approaches make use of detailed 
information of the actual environment or a synthetic environment. Consequently, 
these models are appropriate to study the use of satellite diversity since geometry 
information is inherently included in such models. This is true for other modeling 
approaches like purely statistical or empirical models. 

In this paper we are using a physical-statistical land mobile satellite channel model. 
The model proposed here is based on computing the geometrical projection of 
buildings surrounding the mobile, described through the statistical distributions of 
their height and width [19]. The existence or absence of the direct ray defines the line- 
of-sight (LOS) or shadowed state, respectively. The model can be divided into two 
parts: 
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Deterministic or Statistical Parameterization of Urban Environment 

The physical-statistical approach used here proposes a canonical geometry for the 
environment traversed by a mobile receiver, typically a street as shown in Figure 1. 
The canyon street, composed of buildings on both sides, may block the satellite link 
along the mobile route depending on the satellite elevation. 




Fig. 1. Shadowed and Line-of-sight satellite links. Buildings can be obtained either from a 
BDB or through generating synthetic environment fh,^: building height, h^: mobile height). 

In the case of deterministic characterization of the urban environment, a Building 
Data Base (BDB) is used to obtain the canyon street data. Then a receiver is placed at 
a given position (right or left lane or sidewalk) and the skyline (building profile in the 
sky direction) as seen by the receiver terminal is computed. For fixed users, the 
skyline obtained is fixed; only the satellites will be moving according to the 
constellation dynamics. In the case of a user moving along a given street, the skyline 
seen by the mobile as well as the satellites will be time varying. Note that for satellite 
systems using Gateways (GW), the signal goes from transmitter to receiver through 
the GW. However, the GW links must be considered free of shadowing effects due to 
the environment since the GW antennas are sufficiently elevated and directive and 
keep tracking the satellites in view. 

In order to also address a statistical approach, which means computing synthetic 
canyon streets, we investigated urban canyon street geometry to parameterize real 
street canyons. Statistical approach is clearly of interest towards general results 
provided that we use real data to generate the canyon streets. In addition, statistical 
approaches are generally less time consuming and BDB are not always available. 
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Heights of urban environments from two European countries were studied and 
statistically parameterized enabling the generation of realistic streets. High and 
medium built-up density areas from England and Spain were investigated. Heights of 
buildings in London and Guildford were found to be log-normally distributed while 
three different sectors of Madrid were found to adhere to a truncated Normal 
distribution. These results are summarized in Table 1. 

For the simulations performed in this work, we used the Madrid Street that has an 
average masking angle of 30°. 



Table 1. Fitted distributions of building heights. 



Country 


Location 


General 

Description 


Building Heights 


England 


London 


Densely built-up 
district 


Log-normal 
p = 17.6 m, o = 0.31 m 


Guildford 


Medium-size town 


Log-normal 
jl = 7.2 m, o = 0.26 m 


Spain 


Madrid, 

Castellana 


Central business 
district 


Normal 

|I = 21.5 m, o = 8.9 m 


Madrid, 

Chamberi 


Residential area 


Normal 

p = 12.6 m, o = 3.8 m 



Calculation of the Elevation Angle to the Skyline (Masking Angles) 

Once the canyon shaped street is available, either by extracting data form DBDs or by 
computing it, the elevation angle to the skyline can be computed. At the user position, 
a scan of 360 degrees is performed to compute the elevation angle to the skyline, i.e., 
the masking angle is computed for every azimuth angle around the user terminal. 
Figure 2 shows an example of the skyline surrounding the user. Buildings are 
synthetic and are generated with parameters corresponding to Madrid-Castellana. 
Figure 3 shows an example of computed masking angles for four streets in Madrid. 

To determine link conditions, we use these computed elevation masking angles for 
different values of the azimuth angle under which the satellite is seen. If the satellite 
elevation angle is larger than the masking angle, it is assumed to be in line-of-sight, 
otherwise it is blocked. The procedure is repeated for all satellites in view. In case 
more than one visible satellite is found the one with the highest elevation angle is 
chosen, and the packet is made available to this satellite. If no satellites are found, the 
packet is dropped. 

Our canyon street geometry includes modeling of crossroads by setting buildings 
to zero height. What is not considered in the canyon shaped street model is the 
eventual presence of second-row buildings, but their effect can be considered 
negligible for the purposes of this paper. 
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Fig. 2. Example (1000 samples) of skyline generated with parameters of London 






Fig. 3. Masking angles computed for 4 streets of Madrid. 



On the other hand, the time the mobile needs to move along a given canyon street 
may be too short (depending on the mobile speed) to obtain statistically significant 
TCP simulation times. In order to obtain long simulation runs very long canyon 
streets were generated to simulate the statistical variation of heights and the 
occurrence of crossroads with the movement. In addition, longer routes are also easily 
and realistically obtained by changing the azimuth reference of the masking angle 
series. 
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5. Simulation Platform and Its Extensions 

The simulations have been performed using ns-2 (Network Simulator) [17], enhanced 
to provide better support for satellite environments. The following were the key 
enhancements added to ns-2: 

Terminal Mobility and Shadowing Channel - A shadowing channel was added to 
simulate the behavior of a mobile terminal in an urban environment. The channel was 
derived from the skyline of a street in London. The terminal is considered shadowed 
if the elevation angle of the satellite is less than the elevation of the skyline, as 
explained in Section 4. The channel has an ON-OFF behavior and the link is assumed 
to be down when the terminal is shadowed. Also, mobility was added to the terminal 
by moving it up and down the street. The skyline seen by the terminal changes as it 
moves and this combined with the current position of the satellite network determines 
the shadowing state of the terminal. 

Gateway - The concept of a “Gateway” node was added to the simulator. This node 
was introduced to model the “Gateways” present in satellite networks like Iridium and 
Globalstar. The Gateway can be used as an interface between the satellite 
constellation and a terrestrial network and this feature can be used to model hybrid 
satellite-terrestrial networks. An important feature of the Gateway node is that it 
maintains links to all satellites that are visible to it. Also, these links typically belong 
to different orbits in a non-polar constellation. This “multiple links” property is used 
to enhance the Globalstar constellation to a “full constellation” in which inter-orbit 
switches are performed by going through the Gateway node. 

Diversity - Since the Gateway node can maintain links to more than one satellite, it 
supports “selection diversity” of satellites. This means it periodically (every 0.4 s) 
evaluates the elevation angle of all the visible satellites to select the best and forwards 
packets to it. 

Mobility Modeling and Handoff - In the simulations, mobility is modeled by moving 
the terminal continuously up and down the street over a straight path of about 10 km. 
The position of the terminal at any time is determined by its speed and the time 
elapsed and the skyline seen by the terminal at that position gives the minimum 
elevation angle below which a satellite is shadowed. While modeling handoffs, we 
assume that the handoff execution time is negligible. The handoff procedure is 
invoked every 0.4 s and a handoff takes place when either the current satellite goes 
below the horizon or the current satellite is shadowed by the skyline. While 
performing the handoff, we look for the “unshadowed” satellite with the highest 
elevation angle. Note that the skyline gives us the minimum elevation angle above 
which a satellite is visible for a certain value of azimuth of the satellite. Thus, the 
azimuth of a satellite together with the information provided by the skyline 
determines whether a satellite is shadowed or not. 
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6. Description of the Experiments 

To measure TCP performance for LEO satellite networks (both Iridium and 
Globalstar constellations) we ran a set of 24 half-hour FTP sessions in various 
topologies. For all the experiments, we ran FTP transfers using different TCP versions 
(Reno, SACK, and Westwood). 

In the first topology, both the satellite gateway and a terminal were co-located in 
Europe (lat: 47°N, long: 7°E), and connected with a single LEO satellite (the LEO 
constellation has no ISLs). In this topology, we measured TCP throughput using a 
range of uniformly distributed packet error rates between 10% and 40%. We also 
performed one experiment for a low packet error rate of 10'“* in order to compare with 
higher packet error rates. In addition to the packet error rate, we varied link capacity 
between 16 Kbit/s and 144 Kbit/s. The results for this topology are presented in 
Figures 4 and 5. 

In the second topology, the terminal was located as in the earlier topology, but the 
gateway was relocated to Los Angeles. In this case, the LEO constellation also has 
inter-satellite links. Again, we performed TCP throughput experiments using packet 
error rates between 10% and 40%. The link capacity was also varied as earlier. Figure 
6 shows the results for this topology. 

The third topology used a GEO satellite to connect two terminals, one located in 
Rome and the other in Washington D.C. respectively, with the satellite positioned 
over the Atlantic Ocean. TCP throughput experiments were performed using packet 
error rates between 10'^ and 10'* Link capacities were 1.024 Mbit/s and 2.048 Mbit/s. 
In this case, only six half-hour FTP sessions were used for the GEO simulations. The 
LEO case required more FTP sessions in order to appropriately capture changes due 
to the time- varying constellation topology. The results for this experiment are shown 
in Figure 7. 

In the fourth set of experiments, the two terminals are again located at Rome and 
Los Angeles. These are connected using two GEO satellites, connected by an ISL. 
The parameters (link capacity, packet error rate) are the same as in the previous 
experiment; the results are shown in Figure 8. 

Finally, we perform shadowing and mobility experiments both for GEO and LEO 
configuration. We use the mobile satellite channel model explained in Section 4. In 
the first shadowing experiment, a mobile terminal, located in Madrid and a Gateway, 
located in Los Angeles are connected using a LEO satellite. The terminal and the 
Gateway run an FTP transfer between them. We do this experiment for the 3 TCP 
versions (Reno, SACK and Westwood) for both Iridium and Globalstar. The link 
capacity is 2.048 Mbit/s and the terminal speed is varied. The results are shown in 
Figure 9. 



7. Simulation Results 

Figure 4 shows performance of FTP as a function of packet error rate for single hop 
LEO configuration. In this case performance is acceptable up to an error rate of 10% 
and there is no significant difference in performance among different TCP schemes; 
the choice of constellation (Iridium or Globalstar) also does not cause much 
difference in performance. 
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Fig. 4. Performance of FTP as a function of packet error rate for single hop LEO configuration 




Capacity (kbit/s) 

Fig. 5. Performance of FTP as a function of capacity for single hop LEO configuration 

Figures 5 shows performance of FTP as a function of capacity for a single hop 
LEO configuration. The graph shows the normalized throughput (the utilization) as a 
function of link capacity. 

In Figure 6, we show performance of FTP as a function of packet error rate for a 
full LEO configuration. 

Eigure 7 shows performance of ETP as a function of capacity and packet error rate 
for a single hop GEO configuration, with users located in Rome and Washington D.C. 
It can be seen that the faster recovery algorithm employed by TCP Westwood, which 
sets the congestion window equal to the estimated link capacity * RTT, shows a better 
performance compared to TCP Reno. It may be noted that TCP Reno halves its 
window each time a packet is lost due to corruption. This new window size will be 
less than the estimated window size by TCP Westwood except when the pipe is full 
(in which case both are equal). 
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packet error rate 



Fig. 6. Performance of FTP as a function of packet en'or rate for full LEO configuration 

Figure 8 shows the results for configuration containing two GEO satellites, 
(connected hy an ISL) with the users located in Rome and Los Angeles respectively. 
TCP Westwood shows better performance in this configuration too. 

All TCP experiments demonstrate a high sensitivity of TCP performance to 
random errors. The impact of errors is particularly egregious for the GEO case as a 
result of increased Bandwidth Delay Product (BDP). TCP Westwood performs 
significantly better in the presence of such random errors. 




Packet error rate 



Fig. 7. Performance of FTP as a function of capacity and packet error rate for single hop GEO 
configuration 
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Fig. 8. Performance of FTP as a function of capacity and PER for double GEO (with ISL) 
configuration 



Figure 9 shows that, when using a LEO constellation, there is not much difference 
between the TCP schemes, but Globalstar outperforms Iridium due to its “diversity 
capability”. In the Globalstar constellation, most areas are covered by more than one 
satellite at any point. Iridium, on the other hand, provides only one satellite to cover 
an area. When one satellite is shadowed, Globalstar is able to select an alternate 
satellite to establish a connection. This enables to provide better service compared to 
Iridium in the presence of shadowing. The performance of TCP Westwood is also 
seen to be slightly better than that of the other TCP versions. 




Fig. 9. Performance of FTP vs. terminal speed for a full LEO configuration in a shadowed 
environment 
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8. Conclusions 

The access to the Internet in the presence of wide range mobility represents one of the 
key issues for future telecommunication systems. In such a scenario, the satellite 
assumes a very important role. 

In this paper, we investigated the performance of various TCP schemes in satellite 
environments characterized by variable propagation conditions. We evaluated 
different architectures (LEO, GEO, single hop, full) in representative fixed/mobile 
terminal scenarios. 

The simulations show that TCP Westwood is able to outperform TCP Reno and 
SACK in the presence of random errors or shadowing. The faster recovery algorithm 
used by TCP Westwood helps it to recover quickly from packet error. Also, the 
satellite diversity provided by Globalstar may be used to provide better connectivity 
in a shadowed environment. 
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Abstract. We present a study on the performance of TCP, in terms of 
both throughput and energy consumption by considering a Wideband 
CDMA radio interface. The results show that the relationship between 
throughput and average error rate is largely independent on the network 
load, making it possible to introduce a universal throughput curve, which 
gives throughput predictions for each value of the user error probability. 
Furthermore, the possibility to select an optimal power control threshold 
to maximize the tradeoff between throughput and energy, is discussed. 



1 Introduction 

Based on the evolutionary trend of the telecommunications market, a significant 
step has been taken by the wireless industry, by developing a new generation 
of systems whose capabilities are intended to considerably exceed the very lim- 
ited data rates and packet handling mechanisms provided by current second 
generation systems. 

The goal of third generation systems is to provide multimedia services to the 
wireless terminals and to offer transport capabilities based on an IP backbone 
and on the Internet paradigm calls for the extension of widespread Internet 
protocols to the wireless domain. 

In particular, extension of the Transmission Control Protocol (TCP) 0 has 
received considerable attention in recent years, and many studies have been pub- 
lished in the open literature, which address the possible performance problems 
that TCP has when operated over a connection comprising wireless links, and 
propose solutions to those problems (see for example lamimiii). 

When TCP is run in a wireless environment, two major considerations must 
be made regarding its performance characterization: wireless links exhibit much 
poorer performance than their wireline counterparts and the effects of this be- 
havior are erroneously interpreted by TCP, which reacts to network congestion 
every time it detects packet loss, even though the loss itself may have occurred 
for other reasons. 

Furthermore, it has been shown that the statistical behavior of packet errors 
has a significant effect on the overall throughput performance of TCP, and that 
different higher-order statistical properties of the packet error process may lead 
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to vastly different performance even for the same average packet error rates m- 
It is therefore important to be able to accurately characterize the actual error 
process as arising in the specific environment under study, as simplistic error 
models may just not work. 

Another critical factor to be considered when wireless devices are used is the 
scarce amount of energy available, which leads to issues such as battery life and 
battery recharge, and which affects the capabilities of the terminal as well as 
its size, weight and cost. Since dramatic improvements on the front of battery 
technology do not seem likely, an approach which has gained popularity in the 
past few years consists in using the available energy in the best possible way, 
trying to avoid wasting power and to tune protocols and their parameters in 
such a way as to optimize the energy use m 

It should be noted that this way of thinking may lead to completely different 
design objectives or to scenarios in which the performance metrics which have 
traditionally been used when evaluating communications schemes become less 
important than energy-related metrics. This energy-centric approach therefore 
gives a different spin to performance evaluation and protocol design, and calls 
for new results to shed some light on the energy performance of protocols. 

Studies on the energy efficiency of TCP have been very limited so far pam, 
pi ,911 tij . In addition, they do not address specifically the Wideband CDMA en- 
vironment typical of third generation systems, and therefore do not necessarily 
provide the correct insight for our scenario. 

The purpose of this paper is to provide a detailed study of the performance of 
TCP, in terms of both throughput and energy, when a Wideband CDMA radio 
interface is used. In particular, the parameters of the UMTS physical layer will 
be used in the study. As a first step of the study, we present results in the absence 
of link-layer retransmissions. This is done in order to more clearly understand 
the interactions between TCP’s energy behavior and the details of the radio 
interface. Results on the throughput performance of TCP in the presence of 
link-layer retransmissions can be found, e.g., in 191101171 . where on the other 
hand energy performance is not studied. 



2 System Model 

2.1 Simulation at the System Level 

In order to carry out the proposed research, a basic simulation tool has been 
developed. It simulates the operation of a multicellular wideband CDMA envi- 
ronment, where user signals are subject to propagation effects and transmitted 
powers are adjusted according to the power control algorithm detailed in the 
specifications. 

We consider a hexagonal cell layout with 25 cells total. This structure is 
wrapped onto itself to avoid border effects. Each simulation is a snapshot in 
time, so that no explicit mobility is considered while running the simulation. 

The fact that users may be mobile is taken into account in the specifica- 
tion of the Doppler frequency, which characterizes the speed of variation of the 
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Rayleigh fading process. For the same reason, long-term propagation effects, 
namely path loss and shadowing, are kept constant throughout each simulation. 
Path loss relates the average received power to the distance between the trans- 
mitter and the receiver, according to the general inverse-power propagation law 
Pr{r) = Ar~^, where in our results we chose j3 = 3.5. Shadowing is modeled 
by a multiplicative log-normal random variable with dB-spread cr = 4 dB, i.e., 
a random variable whose value expressed in dB has Gaussian distribution with 
zero mean and standard deviation equal to 4IIBI. 

At the beginning of each simulation, all user locations are randomly drawn 
with uniform distribution within the service area, and the radio path gains to- 
wards all base stations are computed for each. Users are then assigned to the 
base station which provides the best received power. Such assignment does not 
change throughout the simulation. Also, note that each user is assigned to a 
single base station, i.e., soft handover is not explicitly considered here. Exten- 
sion of the program to include soft handover is currently in progress, although 
qualitatively similar results are expected. 

During a simulation run, the fading process for each user is dynamically 
changed, according to the simulator proposed by Jakes m and to the selected 
value of the Doppler frequency. Note that the instantaneous value of the Rayleigh 
fading does not affect the base station assignment. In order to take into account 
the wideband character of the transmitted signals, a frequency-selective fading 
model is used, with five rays whose relative strengths are taken from [I2|. Max- 
imal ratio combining through RAKE receiver is assumed at the base station. 
Only the uplink is considered in this paper, although similar results have been 
obtained for the downlink as well. 

Each connection runs its own power control algorithm as detailed in the 
specifications. The resulting transmitter power levels of all user, along with the 
instantaneous propagation conditions, determine the received signal level at each 
receiver, and therefore the level of interference suffered by each signal. We assume 
here perfect knowledge of the Signal-to-Interference Ratio (SIR) which is used 
to make the decision about whether the transmitted power should be increased 
or decreased by the amount A. A finite dynamic range is assumed for the power 
control algorithm, so that under no circumstances can the transmitted power be 
above a maximum or below a minimum value. The delay incurred in this update 
is assumed to be one time unit (given by the power control frequency of update), 
i.e., the transmitted power is updated according to the SIR resulting from the 
previous update. The effect of late updates on the overall performance has been 
studied in |2S|. 

Table n summarizes the various parameters used in the simulation. 



2.2 Block Error Probability 

The output of the system level simulations is a log of the values of the SIR, 
transmitted power and fading attenuation for all users, which allows us to gain 
some understanding on the time evolution of these quantities, as well as on the 
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Table 1. System-level simulation parameters 



PARAMETER 


VALUE 


Cell Side 


200 m 


j3 (path loss model) 


3.5 


A (path loss model) 


-30 dB 


Max. TX Power 


-16 dBW 


Power Range 


80 dB 


(T (shadowing) 


4 dB 


Time Unit 


0.667 ms 


nnmber of Oscillators (Jakes) 


8 


n_rays (Selective Channel) 


5 


Chip Rate 


3.84 Mcps 


Data Rate 


240 kbps 


SF (Spreading Factor) 


16 


A (Power Control Step) 


0.5 dB 


Noise 


-132 dBW 



behavior of the network at the system level. A post-processing package translates 
the SIR traces into sequences of block error probabilities (BEP). 

This is done while taking into account how the radio frames are deinterleaved 
and decoded. The interleaving schemes have been taken from the specifications, 
and a convolutional code with rate 1/2 and constraint length 8 with Viterbi 
decoding has been considered. An analytical approximation has been used to 
relate a string of SIR values (one per time unit) to the probability that the 
corresponding block (transmitted within one or more radio frames) is in error. 

The resulting trace of the block error probabilities can then be used in the 
simulation of higher-layer protocols, as is done in this paper, or to perform some 
statistical analysis of block errors. The latter approach is explored in m, where 
the burstiness of the error process is looked at in some detail. 



2.3 TCP Simulation 

For each simulation run (which corresponds to a given set of parameters), the 
SIR traces of all users are produced, which are then mapped into BEP traces 
as just explained. The latter are then used to randomly generate a block error 
sequence (BES). 

The BES is generated from the BEP traces by just flipping a coin with 
the appropriate probability in each Time Transmission Interval (TTI), which is 
the time used to transmit a block. The BES obtained is then fed to the TCP 
simulator that uses it to specify the channel status in each TTI. By doing so, 
the SIR traces generated for all users in a simulation run are used by the TCP 
simulator to compute the throughput. 

The average throughput is defined as the fraction of the channel transmission 
rate (considered at the IP level output) which provides correct information bits at 
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the receiver, i.e., not counting erroneous transmissions, spurious retransmissions, 
idle time and overhead. 

In addition to the throughput value, the TCP simulator computes other 
parameters of interest, i.e.: the average block error probability, Pe (obtained for 
every user simply by averaging all the values of its BEP sequence); the average 
energy spent in transmitting data (obtained from the transmit power traces). 

These traces have been generated assuming continuous channel transmission 
and by updating the transmitted power according to the power control algorithm. 
However, when TCP is considered, transmission is bursty, due to the window 
adaptation mechanism implemented by the TCP algorithms, i.e., idle times occur 
when the window is full or the system is waiting for a timeout. 

To account for idle times, we have considered the actual transmitted power, 
equal to the one obtained from power traces when TCP transmits, and to zero 
otherwise. The average transmitted power is then computed by summing the 
actual transmitted power throughout the simulation (all slots) and dividing it 
by the total simulation time (in number of slots). Moreover, to obtain the av- 
erage consumed energy per correctly received bit, we simply divide the average 
transmit power by the correct information bit delivery rate. 

In the following, we report the throughput and average consumed energy 
expressions: 



S = 



CIB 

CBR 



ACE = 



ATP 

CIB 



ATP 

S-CBR 



where S'=average throughput, C I B=correct information bits per second, CBR= 
channel bit rate at the IP level output, ACif=average consumed energy per bit, 
and ATP=average transmitted power. We remark that, with our definitions, 
ACE is the average consumed energy per correctly received bit, i.e., the energy 
cost of delivering a single bit to the destination. Notice that this quantity is 
equal to the inverse of the energy efficiency of the protocol as defined in m- 

The TCP simulator implements fragmentation of TCP segments, window 
adaptation and error recovery. It simulates a simple unidirectional ftp session, 
where the direct link packet generation is assumed continuous as in a long ftp 
transfer. The TCP algorithm considered is New Reno m- Data flow is unidi- 
rectional, i.e., data packets flow only from sender to receiver, while ACKs flow 
in the reverse direction. Receiver generates non-delayed ACKs, i.e., one ACK is 
sent for each packet received. The TCP/IP stack is version 4, with a total of 
40 bytes (including both TCP and IP overhead) for each header (compression is 
not taken into account in the results presented) and MTU size of 512 bytes. We 
have not considered RLC and MAC levels, which are assumed here to operate 
in transparent mode. RLC/MAC level implementation and characterization are 
currently under study. 

To compute the bit rate at the output of the IP level, we have to account for 
overhead added by the physical layer as well as possibly due to multiplexing of 
other channels. For the purpose of discussion, we assume the following figures: 
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a transport block is 1050 bits, including 16 bits of overhead due to transparent 
RLC/MAC operation; a CRC and tail bits for code termination are added to 
this block, and the result is convolutionally encoded at rate 1/2. The resulting 
encoded block is then brought to 2400 encoded symbols by the rate matching 
algorithm, so that the raw physical layer symbol rate is 240 kbps. The application 
of a spreading factor SF = 16 makes it 3.84 Mcps, which is the standard channel 
transmission rate. At the IP level output, we then have a block of 1034 bits of 
data every 10 ms, thereby yielding a net bit rate of 103.4Kbps, which is the bit 
rate used in the TCP simulator. 

The use of TCP New Reno algorithm has been motivated by its implemen- 
tation of fast recovery and fast retransmit algorithms as recommended in 
especially for wireless environments. This is an optimization over previous 
TCP versions, and is currently at the Proposed Standard level. 

3 TCP Throughput Performance 

In the graphs presented we will indicate with the number of users in the 
simulation, with TTI the number of radio frames over which interleaving is 
performed, SIRth = t + A (Si indicates that the threshold used in the power 
control algorithm is t, while A is its increment as described in Section 0 Finally, 
with the term fa we refer to the Doppler frequency used in the Rayleigh fading 
simulator. In the following graphs, the results will be represented as average 
TCP throughput, S, vs. average block error probability, Pg, thereby assigning a 
single point in the graph to each user. 

3.1 Doppler Frequency Effect 

Figure Q shows the TCP throughput performance for different values of the 
Doppler frequency. The graph is plotted by reporting throughput vs. Pe for each 
of the 90 users involved in the simulation; each user is identified by a marker. 
The case of independent errors is also reported for comparison purpose (here the 
markers are used only to identify the curve and are not related to the users). 

Note that, for a given value of the Doppler frequency the points representing 
the various users of a simulation appear to lie along a fairly well-defined curve. 
It is worth stressing that this was not obvious a priori since different users are 
placed in different locations and are subject to different propagation conditions, 
both in terms of slow impairments (log-normal shadowing) and in terms of fast 
fading. 

This allows us to introduce the concept of “universal throughput curve” for a 
given situation, in the sense that users which suffer similar values of Pe will enjoy 
about the same throughput. Again, this is not obvious since different users in a 
simulation may see different statistical behaviors of the errors, which could in 
principle lead to different performance even in the presence of the same average 
error rate m- 

An explanation can be drawn from the results in !2ni, where it was found 
that for a given value of fd there is a strong correlation between Pg and the 
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0.001 0.01 0.1 1 



Pe 



Fig. 1. TCP/IP Sensitivity to Doppler, (A'u = 90, TTI = 1, SIRth = 5.5 + 0.5dB, 
fd = 2,6,20,40Hz). 



error burstiness. In this situation, Pg essentially determines the full second-order 
characterization of the error process, which in turn almost fully specifies the value 
of the TCP throughput. On the other hand, for different values of fd we observe 
different curves. In fact, even in the presence of the same Pg the different extent 
of the channel memory results in different performance. 

Another interesting observation from Figure 0 is that as the Doppler fre- 
quency increases the performance degrades, i.e., slower channels correspond to 
better performance, as already observed in |Bj. As expected, for sufficiently high 
values of the Doppler frequency, the behavior of the system is close to the iid 
case. 

Finally, we note that the shape of the curves appears to be fairly regular, 
with a smooth transition from highest throughput values (essentially limited 
only by the percentage of overhead in the TCP packets, far left of the graph), to 
essentially zero throughput when errors are very likely (right end of the graph) . 
This shape, which has been observed by other authors, lends itself nicely to 
numerical fitting, as detailed later. 

3.2 Interleaving Depth Effect 

In UMTS, besides the so-called intra-frame interleaving, which is always used 
to scramble the bits within a radio frame (10 ms) before encoding, it is also 
possible to use a second, interframe, interleaving, which mixes bits across frames. 
By doing so, the performance of the decoder is of course improved also in the 
presence of burst errors, but a larger interleaving delay is introduced. 

Therefore, another interesting sensitivity analysis regards the interleaving 
span allowed by the application. While keeping in mind the price to be paid in 
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Fig. 2. Average throughput vs TTI,(iVu = 90, TTI — 1, 2, 4, 8 , SIRth = 5.5 + 0.5dB, 
fd = 6Hz). 



terms of delay (an increase of the TTI corresponds to a larger delay), we can 
note from Figure|21how a deeper interleaving gives a beneficial effect. Notice also 
that the shape of the curves is similar to what observed in the previous case. 

3.3 Network Load Effect 

The effect of the network load is shown in Figures |3 and 0] In Figure |3 results 
from three simulations are shown, with 80, 90 and 100 users in the network, 
respectively. For each simulation, all users are assigned a point with the same 
abscissa (which is given by the number of users in the system) and with vertical 
coordinate given by their average TCP throughput. 

We can see how for increasing load the presence of disadvantaged users be- 
comes more noticeable, as expected, and the system is more and more unfair. 
The same results are represented in Figure 0 by reporting the throughput as 
a function of where the curve along which the various points are aligned is 
relatively insensitive to the system load. 

In Figure^ it was observed that, in a given scenario, two users whose average 
block error probability is the same will have essentially the same throughput, 
regardless of the specific situation of each. What we see here is that this be- 
havior still holds across simulations in which different levels of network load are 
considered (but for the same Doppler frequency) . For higher network load, each 
user will certainly see worse performance due to the increased interference, but 
the relationship between average throughput and average error rate is essentially 
unaffected. 

This highlights the power of the concept of “universal curve” which can be 
used to study typical cases and to infer TCP throughput performance based 
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Fig. 3. TCP/IP average throughput vs network load, {Nu = 80,90,100, TTI = 1, 
SIRth = 5.5 + 0.5dB, fd = 6Hz). 



only on easily measurable physical layer parameters. In Figured a possible form 
for the universal curve is given for comparison with the simulator output points; 
more details about this fitting function will be presented below. Similar behavior 
has been observed for other values of the Doppler frequency. 



3.4 Analytical Throughput Prediction 



The observed shape of the throughput curves, which tend to a constant equal 
to one minus the percentage of overhead for P^, ^ 0 and to zero for P^, ^ 1^ 
suggests a numerical fit involving the logarithm of Pe- Also, the shape of the 
transition is seen to depend on the value of the Doppler frequency. 

In Figure 0 we show a proposal for the modeling of the TCP throughput 
behavior through a heuristic function f{x), independent of the network load 
and parameterized only by the Doppler frequency, as suggested by the curves 
of Figure 01 as well as by other results not shown in this paper. The proposed 
expression for f{x) is as follows: 



fix) 



S{0)- 






with X = 1+ Q, = x_3^ Xs = Afl+B ~ ^ where A = 1.39, B = 2.78, 

k = 0.03, and 5(0) is the average throughput for Pg = 0 and is the Doppler 
frequency in Hz. 

The accuracy of the proposed fit has been tested for various values of the pa- 
rameters involved. Examples of these tests are given in Figures 0 and El in which 
the fitting expression is compared against the simulation results for two values 
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Pe 



Fig. 4. Throughput curve, independence from network load with Doppler GHz, {Nu = 
80, 90, 100, TTI = 1, SIRth = 5.5 + 0.5dB, fd = GHz). 




X 



Fig. 5. Throughput heuristic function. 



of the Doppler frequency. These graphs show that the analytical expression is 
reasonably close to the actual points obtained by simulation. 



4 TCP Energy Performance 

All previous results were obtained for a given value of the power control thresh- 
old, SIRth, which is used to drive the transmit power dynamics at each user and 
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Fig. 6. Throughput heuristic function: approximation of simulator outputs, {Nu = 90, 
TTI = 1, SIRth = 5.5 + 0.5, fd = 6, 40Hz). 



which directly affects the error performance. In fact, choosing a higher value of 
this threshold has the double effect of forcing the users to transmit more power 
in order to achieve a higher SIR (thereby consuming more energy) but also of 
causing the SIR experienced by the typical user to be higher (thereby improving 
the error rates and therefore the TCP throughput). 

It is therefore of interest to study how varying the power control SIR thresh- 
old makes it possible to cut a tradeoff between QoS and energy consumption. 



4.1 Throughput and Consumed Energy 

Figures [ 7 ] and 0 show the trade-off between TCP throughput and consumed 
energy. Each curve corresponds to a given user for a given set of parameters, 
whereas different points on the same curve refer to different values of the power 
control threshold. Figure Q shows the effect of using different threshold values 
in terms of S' as a function of ATP. In this figure only the behavior of some 
selected users has been reported. These users can be seen as representative of all 
users in the network, in the sense that they illustrate typical behaviors as they 
arise in the system. 

In FigureElthe same results are shown, by considering ACE instead of ATP\ 
as expected, for low throughput the obtained curves are shifted to the right, 
since ACE is obtained by dividing ATP by the throughput S (except for an 
inessential constant scaling factor. This is intuitively explained by the fact that, 
for a given average consumed power, the energy per correct information bit is 
greater for low throughput values, i.e., when it is hard to deliver bits correctly, 
the cost associated to each one of them is higher. Notice that TCP already 
does the right thing by stopping transmission when the channel is very bad (the 
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ATP (dBW) 



Fig. 7. Throughput vs average transmitted power for different threshold values, (AT = 
120, TTI = 1, SIRth = {1.5,2.5,3.5,4,4.5,5} + 0.5dB, fd = 6Hz). 
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Fig. 8. Throughput vs average consumed energy for different threshold values, (AT = 
120, TTI = 1, SIRth = {1.5,2.5,3.5,4,4.5,5} + 0.5dB, fd = 6Hz). 




timeout event), whereas it tries to recover from errors whenever possible through 
retransmissions, which may waste some power. 

In general, increasing the threshold should lead to better performance for 
many users, since the SIR experienced is expected to be higher; this is not 
necessarily true, however, since, in order to achieve a higher SIR threshold, 
many users will transmit more power, thereby causing more interference in the 
system. If the SIR objective is not achievable for all users in the system, some 
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users will actually see degraded performance for higher values of the threshold 
since, although the threshold value would correspond to better performance, 
they cannot achieve its value. 

From the obtained results we have noted different users behavior. For some 
users an increasing power threshold always corresponds to a greater throughput: 
for these users the throughput as a function of the power threshold is a monotonic 
curve. For others, the throughput vs. consumed power curve increases up to 
a breakpoint, after which an increment of the power threshold (and thereby 
of the transmitted power) actually leads to worse performance, due to greater 
interference as discussed above. 

In our results, user 14 is the one showing monotonic behavior, since it experi- 
ences favorable propagation conditions, and therefore is not significantly affected 
by the increased interference level in the system. For users 11, 33 and 114, a dif- 
ferent situation can be observed. In particular, user 11 is the one with the worst 
behavior as the power threshold increases. 

In any event, from these results we can conclude that increasing the target 
value of the SIR in the system does not necessarily translate into improved 
quality, but there exists an optimal value of the threshold, beyond which some 
users will experience negligible throughput improvements, whereas others will 
even see degraded performance. In the cases studied in this paper, this optimal 
value is seen to be close to 3.5 dB. 

4.2 Energy and Threshold 

Another important remark regards the numerical values shown in Figures Qand 
0 It can be clearly seen that unlike for throughput, which except for badly chosen 
values of the threshold exhibits relatively small variations, the range spanned 
by the energy performance extends over multiple orders of magnitude. This 
indicates that the choice of the proper power control threshold, while certainly 
important for error and throughput considerations, becomes critical when energy 
performance is considered. 

Figurel^shows the average consumed power as a function of the power thresh- 
old. As expected, a greater threshold value always corresponds to a higher con- 
sumed power, i.e., all curves have a strictly monotonic behavior. This is due to 
the fact that for higher threshold values, more power is necessary in order to 
obtain the required SIR target. 

A different behavior is observed in Figure Ml in which the consumed energy 
per bit is reported instead. In particular, notice that in the far left of the graph, a 
decrease of the threshold, although corresponding to smaller average power (see 
Figure ED, results in a higher energy cost per bit. This behavior corresponds to 
users suffering from low throughput performance, where the consumed energy 
per correct information bit grows, as explained before. As a last observation, 
we note from Figure El that users 11, 33 and 114, i.e., the ones that suffer from 
system interference as the threshold grows, show a transmitted power which is 
essentially constant for values of the threshold beyond 4 dB, This is due to the 
fact that these users, in trying to achieve the required SIR and to make up 
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power threshold 



Fig. 9. Average consumed power vs power threshold, {Nu = 120, TTI = 1, SlRth = 
{1.5, 2.5, 3.5, 4, 4.5, 5} + 0.5dB, fd = 6Hz). 




Fig. 10. Average consumed energy vs power threshold, {Nu = 120, TTI = 1, SIRth = 
(1.5, 2.5, 3.5, 4, 4.5, 5} + 0.5dB, fd = 6Hz). 



for the increased interference, have reached the maximum allowed value for the 
transmit power, and therefore their power can not be increased any more. 



4.3 Error Probability and Consumed Energy 

A similar QoS-energy trade-off relates to the error rate performance instead 
of the TCP throughput. In Figure 1771 the block error probability has been 
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Fig. 11. Pe vs average transmitted power for different threshold values, {Nu = 120, 
TTI = 1, SIRth = {1.5,2.5,3.5,4,4.5,5} + 0.5dB, fd = 6Hz). 
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Fig. 12. Pe vs average consumed energy for different threshold values, = 120, 
TTI = 1, SIRth = {1.5, 2.5, 3.5, 4, 4.5, 5} + 0.5dB, fa = 6Hz). 



reported against the average transmitted power ATP by using different values of 
the threshold. From the graph, we note that Pe decreases as the threshold grows 
until a minimum is reached. After this point Pg starts to grow, again due to the 
increased interference in the system. The points in which Pg has a minimum are 
the same on which the throughput of the system is maximized (see Figure I3). 

As before, user 14 is the only user considered for which Pg never grows, i.e., 
increasing the threshold always leads to better performance. In Figure ITTI ACE 
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is reported instead of ATP and, as in the previous cases, some points are shifted 
to the right due to poor throughput performance. 

From the above results, we may conclude that the selection of the power 
control SIR threshold is critical in cutting the right trade-offs between QoS and 
energy performance. In particular, we observe that the potential for energy gains 
is very significant compared to similar effects on throughput, being measured 
over multiple orders of magnitude. Therefore, it seems that more attention should 
be given to energy consumption issues at the Radio Resource Management level, 
which is responsible for the power control parameter selection. 

5 Conclusions 

In this paper, some results on the behavior of TCP over a Wideband CDMA air 
interface have been reported. In particular, TCP throughput curves have been 
obtained, and their dependence on various parameters, such as the number of 
users, the interleaving depth, and the Doppler frequency, has been investigated. 

From this study, we have found that the relationship between average TCP 
throughput and average block error rate is largely independent of the number of 
users in the system. For this reason, it is possible to empirically characterize such 
a curve with a matching function that only depends on the Doppler frequency. A 
study of the energy consumption has also been performed, showing that a trade- 
off between the throughput and the power control threshold exists. It is therefore 
possible to trade-off QoS of the data transfer for increased energy efficiency. For 
many users, it has been shown that an optimal value of the threshold exists, 
potentially leading to very significant energy savings in return for very small 
throughput degradation. In the considered case, values of the power control 
threshold close to 3.5-4 dB cut the best tradeoff. 

In order to focus exclusively on the interactions of TCP with the WCDMA 
radio technology and as a preliminary step in characterizing TCP’s energy per- 
formance in this scenario, no link-layer retransmissions have been considered in 
this study. Future work includes extension of the study to the presence of a radio 
link layer which improves the wireless link performance through block retrans- 
mission. Similar performance studies for TCP versions other than New-Reno 
also seem worth pursuing. 
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Abstract. The General Packet Radio Service extends the existing GSM mobile 
communications technology by providing packet switching and higher data 
rates in order to efficiently access IP-based services in the Internet. Even though 
Quality-of-Service notions and parameters are included in the GPRS 
specification, no realization path for Quality-of-Service support has been 
proposed. In this paper we adapt the Differentiated Services framework and 
apply it over the GPRS air interface in order to provide various levels of service 
differentiation and a true end-to-end application of the Internet Differentiated 
Services architecture. 



1 Introduction 

The convergence of mobile technologies with the technologies of the Internet was of 
great importance this last decade. One step towards this direction was made by the 
introduction of the General Packet Radio Service (GPRS) over the Global System for 
Mobile communications (GSM). GPRS is a packet-switched service offered as an 
extension of GSM. In contrast to the classic circuit-switched service provided by 
GSM, GPRS offers the efficiency of packet- switching desirable for bursty traffic, 
higher transfer speeds than the ones available today to a single end-terminal 
(theoretically up to 115 kbps) and instantaneous connectivity with any IP-based 
external packet network. 

An important issue in this context is the Quality-of-Service (QoS) provided by 
GPRS. Even though GPRS specifications define QoS parameters and profiles, we are 
unaware of specific implementation plans and strategies in order to support specific 
QoS models, particularly over the wireless access network. Recent proposals in the 
area of GPRS QoS focus on providing QoS support in the core GPRS network (which 
is typically non-wireless and IP based) [11] using the standard Internet QoS 
frameworks (i.e., Integrated Services or Differentiated Services). 

On the other hand, we believe that the critical part for the support of QoS to the 
applications and the end users is the access network where, because of the scarcity of 
the radio spectrum, greater congestion problems can result. Therefore, we have 
developed an architecture that provides QoS in the form of support for Differentiated 
Services over the radio link and integration with the Internet DiffServ architecture, 
thus providing end-to-end QoS “guarantees” [12]. As described later in this paper, 
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GPRS operators can easily implement this proposal, with no need for radical changes 
to their existing GPRS network architecture. 

The structure of the remainder of this paper is as follows. First we provide a short 
overview of the GPRS technology and architecture. We then review briefly the 
Internet Differentiated Services architecture and we focus particularly on the 
description of the two-bit DiffServ scheme. In the following section we adapt the two- 
bit DiffServ scheme in the GPRS environment, describing all the new tasks that are 
required to be performed by the GPRS Serving Nodes (GSNs), the key new elements 
in the GSM architecture introduced to support GPRS. Finally, we discuss some open 
issues and present our conclusions. 



2 The GPRS Environment 

GPRS [2] is a new service offered by the GSM network. In order for the operators to 
be able to offer such services two new types of nodes must be added to the existing 
GSM architecture. These two nodes are the serving GPRS support node (SGSN) and 
the gateway GPRS support node (GGSN), as shown in the Fig. 1 . 




The SGSN keeps track of the location of mobile users, along with other 
information concerning the subscriber and its mobile equipment. This information is 
used to accomplish the tasks of the SGSN, such as packet routing and switching, 
session management, logical link management, mobility management, ciphering, 
authentication and charging functions. The GGSN, on the other hand, connects the 
GPRS core network to one or more external Packet Data Networks (PDNs). Among 
its tasks, is to convert the incoming packets to the appropriate protocol in order to 
forward them to the PDN. Also, the GGSN is responsible for the GPRS session 
management and the correct assignment of a SGSN to a Mobile Station (MS), 
depending on the MS's location. The GGSN also contributes to the gathering of useful 
information for the GPRS charging subsystem. 

The core GPRS network is IP based. Among the various GSNs (SGSN and GGSN) 
the GPRS Tunnel Protocol (GTP) protocol is used. The GTP constructs tunnels 
between two GSNs that want to communicate [1]. GTP is based on IP. At the radio 
link, the existing GSM structure is used, making it easier for operators to offer GRPS 
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services. The uplink and downlink bands are divided through FDMA into 124 
frequency carriers each. Each frequency is further divided through TDMA into eight 
timeslots, which form a TDMA frame. Each timeslot lasts 0.5769 ms and is able to 
transfer 156.25 bits (both data and control). The recurrence of one particular timeslot 
defines a Packet Data CHannel (PDCH). Depending on the type of data transferred, a 
variety of logical channels are defined, which carry either data traffic or traffic for 
channel control, transmission control or other signaling purposes. 

The major difference between GPRS and GSM concerning the radio interface is 
the way radio resources are allocated. In GSM, when a call is established, a channel is 
permanently allocated for the entire period. In other words, one timeslot is reserved 
for the whole duration of the call, even when there is no activity on the channel. This 
results in a significant waste of radio resources in the case of bursty traffic. In GPRS 
the radio channels, i.e. the timeslots, are allocated on a demand basis. This means that 
when a MS is not using a timeslot that has been allocated to it in the past, this timeslot 
can be re-allocated to another MS. The minimum allocation unit is a radio block, i.e. 
four timeslots in four consecutive TDMA frames. One RLC/MAC packet can be 
transferred in a radio block. 




One or more (multi-slot capability) timeslots per TDMA frame may be assigned to 
a MS for the transfer of its data. During the transfer, the Base Station Subsystem 
(BSS) may decrease (or increase in some cases) the number of timeslots assigned to 
that particular MS, depending on the current demand for timeslots. This is 
accomplished by the use of flags (Uplink State Elag) and counters (Countdown 
Value) in the headers of the packets transferred on the radio link. 

In order to make an exchange of data with external networks, a session must be 
established between the MS and the appropriate GGSN. This session is called Packet 
Data Protocol (PDP) context [4]. During the activation of such a context, an address 
(compatible with the external network, i.e. IP or X.25) is assigned to the MS and is 
mapped to its IMSI and a path from the MS to the GGSN is built. The MS is now 
visible from the external network and is ready to send or receive packets. The PDP 
context concerns the end-to-end path in the GPRS environment (MS GGSN). 
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PDP Context 





radio blocks 



Fig. 3. PDP context and TBF 

At the (lower) radio link level, when the MS starts receiving/sending data, a 
Temporary Block Flow (TBF) is created [5]. During this flow a MS can receive and 
send radio blocks uninterrupted. For a TBF establishment, the MS requests radio 
resources and the network replies indicating the timeslots available to the MS for data 
transfer. A TBF may be terminated even if the session has not ended yet. The 
termination of a TBF depends on the demand for radio resources and the congestion 
of the link. After the termination, the MS must re-establish a new TBF to continue its 
data transfer. 

ETSI has also specified a set of QoS parameters and the corresponding profiles that 
a user can choose. These parameters are precedence, reliability, delay, and peak and 
mean throughput [3]. Precedence (priority) defines three classes (high, medium and 
low). Three classes are also defined for reliability. Four classes for delay, nine classes 
for peak throughput and thirty-one classes for mean throughput (including best- 
effort). A user’s profile may require that the level of all (or some) parameters is 
defined. This profile is stored in the HLR and upon activation of a PDP context the 
mobile station is responsible for the required uplink traffic shaping. On the downlink, 
the GGSN is responsible to perform traffic shaping. It is obvious that such an 
implementation will not guarantee that a user will conform to the agreed profile. Also, 
the QoS profiles are not taken into consideration by the resource allocation 
procedures. Thus, it is up to the GPRS operator to use techniques that provide QoS 
“guarantees” and to police user traffic. 

A first step in this direction is to use only the precedence parameter to define QoS 
classes and link allocation techniques. Precedence was chosen because of its 
simplicity and effectiveness and because it can be directly implemented in the GPRS 
architecture, as we will see in the following sections. Also, precedence can introduce 
very easily the idea of Differentiates Services, which is the preferred (realistic) 
approach for QoS in the Internet, gaining wide acceptance. 



3 Differentiated Services 

The Internet is experiencing increased publicity lately and great success. Multimedia 
and business applications have increased the volume of data traveling across the 
Internet, causing congestion and degradation of service quality. An important issue of 
practical and theoretical value is the efficient provision of appropriate QoS support. 
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Integrated Services [6], [7] was proposed as a first solution to the problem of 
ensuring QoS guarantees to a specific flow across a network domain, by reserving the 
needed resources at all the nodes from which the specific flow goes through. This is 
achieved through the Resource Reservation Protocol (RSVP) [6], which provides the 
necessary signaling in order to reserve network resources at each node. Although the 
Integrated Services solution works well in small networks, attempts to expand it to 
wider (inter-)networks, such as the Internet, has revealed many scalability problems. 

An alternative architecture. Differentiated Services [7], was designed to address 
these scalability problems by providing QoS support on aggregate flows. In a domain 
where Differentiated Services are used, i.e. a DS domain, the user keeps a contract 
with the service provider. This contract, the Service Level Agreement (SLA) will 
characterize the user’s flow passing through this DS domain, so as to include it in an 
aggregate of flows. The SLA also defines behavior of the domain’s nodes to the 
specific type of flow, i.e. the Per-Hop Behavior (PHB). SLAs are also arranged 
between adjacent DS domains, so as to specify how flows directed from one domain 
to another will be treated. 




n First Hop Router H Internal Router EH Border Router 
Fig. 4. The Differentiated Services Architecture 

The DS field in an IP packet defines the PHB that each packet of a particular flow 
type shall have. This field uses reserved bits in the IP header-the “Type Of Service” 
field in IPv4 and the “Traffic Class” field in IPv6. In Fig. 4 we depict the DS 
architecture. The first-hop router is the only DS node that handles individual flows. It 
has the task to check whether a flow originated from a user conforms to the contract 
that this user has signed and to shape it, if found to be out of bounds. This is achieved 
by using traffic conditioners. The internal routers handle aggregates of flows and freat 
them according to the PHB that characterizes them. The border router checks whether 
the incoming (or outgoing) flows conform to the contract that has been agreed to 
between the neighbor DS domains. All the traffic that exceeds the conditions of the 
contract is (typically) discarded. 

Currently, there are no standardized PHBs, but two of the basic PHBs are widely 
accepted. These are the Premium (or Expedited) Service [9] and the Assured Services 
[8]. In Premium Service, the key idea is that the user negotiates with the ISP a 
minimum bandwidth that will be available to the user no matter what the load of the 
link will be. Also, the ISP sets a maximum bandwidth allowed for this type of flow, 
so as to prevent the starvation of other flows. In most cases these two limits are equal, 
making Premium Service to act like a virtual leased line or, better, like the CBR 
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service of ATM. The exceeding packets are discarded while the remaining ones are 
forwarded to the next node. 

The Assured Service does not provide any strict guarantees to the users. It defines 
four independent classes. Within each class, packets are tagged with one of three 
different levels of drop precedence. So, whether a packet will be forwarded or not 
depends on the resources assigned to the class it belongs, the congestion level of that 
class and the drop precedence with which it is tagged. In other words. Assured 
Service provides a high probability that the ISP will transfer the high-priority-tagged 
packets reliably. Exceeding packets are not discarded, but they are transmitted with a 
lower priority (higher drop precedence). 

It has been realized that there are many benefits from the deployment of both 
Premium and Assured services in a single DS domain. Premium service is thought of 
as a conservative assignment, while Assured service gives a user the opportunity to 
transmit additional traffic without penalty. Nowadays, Differentiated Services are 
known as the combination of these two services. This new architecture uses a two-bit 
field to distinguish the various types of services and is called Two-bit Differentiated 
Service [10]. 

Each packet is tagged with the appropriate bit (A-bit and P-bit, with null for best- 
effort). The ISP has previously defined the constant rate that Premium Service should 
guarantee. Also, exceeding packets that belong to a Premium flow are dropped or 
delayed, while exceeding packets of Assured Service are forwarded as best effort. In 
Pig. 5 we depict the tasks accomplished by the first hop router of the two-bit DiffServ 
architecture. 




□ 




RIO queue 



Assiued Semce I I Pieiniiuu Service I I Best -effort 

Fig. 5. First Hop Router 

In the first hop router, packets that are tagged by users are checked for their 
conformity with the agreed SLA. In the case of Premium Service, all packets tagged 
with the P-bit wait in the first queue until there is a token available in the token pool. 
When a token becomes available, the packets are forwarded to the ontput queue. In 
the case of Assured Service, the packets for which there is no token available are 
forwarded to the output queue as best-effort packets, with a null tag. The queue that is 
used by both Assured Service and best-effort packets is a RIO (RED with In and Out) 
queue. RIO queues are RED (Random Early Detection) type queues with two 
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thresholds instead of one, one for in-profile packets and one for out-of-profile 
packets. In this case, in-profile are the packets marked with the A-bit, while the rest 
(best-effort packets) are assumed to be out-of-profile. The threshold for in-profile 
packets is higher than the threshold for the out-of-profile packets, so that the later are 
discarded more often than the former. With this technique, a “better than best-effort” 
service is given to the packets using Assured Service. 

Note that in the above figure, only the architecture concerning flows from one user 
is depicted. This is because the first hop router is the first, and only, router that 
controls and shapes individual flows. Therefore, we can assume that for each user 
there are two pools of tokens and a queue. The output queues are the same for all 
users and their characteristics depend on the outbound transfer rate of the router. The 
output queues can be served either by a simple priority scheme or by a more complex 
algorithm, such as the Weighted Fair Queuing (WFQ) algorithm. 

At the border router the same basic tasks are performed, with a small variation. 
Since the border router manages and controls flow aggregates, it cannot buffer the 
packets that exceed the agreements. Thus, the packets tagged with the P-bit are not 
queued, as in the first hop router, but they are discarded, as shown in Fig. 6. 





RIO queue 



3 Assmed Service I I Preinimn Service I I Best-effort 
Fig. 6. Border Router 



4 Differentiated Services over the GPRS Air Interface 

In this section, we apply the Differentiated Services framework to the existing GPRS 
architecture. Specifically, we will see how the two-bit DiffServ architecture fits in 
GPRS, what changes must be made, and how it will be implemented. 

We will give a simple example in order to make clear the reasons why we want to 
apply the Differentiated Services framework in the GPRS environment. Let us 
suppose that the GPRS network is attached to an external IP data network that uses 
Differentiated Services to provide QoS. The MS sends its IP packets to the GGSN, 
over the air interface where they are fragmented into RLC/MAC packets (frames). 
When these packets arrive at the GGSN, they are reassembled to IP packets and they 
are forwarded to the external network. Each IP packet is tagged according to the 
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service that the user wants to receive. Thus, the GGSN acts like the first hop router in 
the Internet context, since there is only one IP hop from the MS to the GGSN, and 
checks whether the user flow conforms to the existing SLA. The next task of the 
GGSN is to forward the packets to the external network, where its nodes behave 
towards the packets as specified by the tag. We can easily conclude that any mobile 
user can use the Differentiated Services, as long as the external PDN supports them, 
in order to specify the way these packets will be treated in the external network. 
However, it is obvious that with the present techniques, the mobile user cannot 
control the way these packets are treated within the GPRS network. Our purpose is to 
design such a mechanism. 

Before we proceed to the application of the Two-bit DiffServ architecture in the 
GPRS environment, we must make some assumptions. First, we assume that the core 
GPRS network has sufficient resources for all traffic. In other words, the point of 
congestion is not the GPRS backbone, but the radio link, i.e. the access link that 
connects the MS with the appropriate BSS. This is an important but reasonable 
assumption given that the scarce resource in the GPRS network is the radio spectrum. 
Also, we assume that the size of the frames transferred over the radio link is fixed and 
equal to the size of a GPRS RLC/MAC packet (frame). 

As described in the previous section, the two-bit DiffServ architecture involves two 
types of nodes in a DS domain: the first hop and the border router. In the case of our 
design for GPRS, we decided to have the GPRS network act as an independent DS 
domain. As far as the border router is concerned, it is obvious that the GGSN is the 
most appropriate node for this task. It is the node that connects two DS domains. The 
GGSN monitors the incoming and outgoing flow aggregates in order to check their 
consistency with the SLAs between the two DS domains. Non-conforming traffic 
should be either discarded or degraded, as depicted in Fig. 6. No special changes need 
to be made to the GGSN in order for it to act as a border router since it communicates 
via the IP protocol with both sides (both the SGSN and the border router of the 
neighbor domain). 




Fig. 7. Two-bit Differentiation Architecture in GPRS 

When a PDP context is activated, the user can request a specific QoS level using 
the quality parameters mentioned earlier. In this case, the user sets the precedence 
parameter equal to one of the three available values. The highest priority makes use of 
the Premium Service, the medium priority of the Assured Service and the lowest 
priority of the best-effort service. This parameter is used to specify the behavior that 
the flow should receive in the GPRS core network, in the external network, if the later 
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one uses Differentiated Services, and also the default radio priority used over the 
radio link. 

As for the first hop router, this should be the BSS. Although its tasks will be the 
same with the ones described in Section 3, its structure will be totally different from 
the one depicted in Fig. 5. This happens because of some differences in the 
architecture between an IP network and a GPRS network. Taking into account that the 
MSs send their data only when the BSS instructs them to and that they use the 
timeslot(s) defined by the USF field, we can assume that the traffic conditioner does 
not reside on the BSS, but it is distributed. The queues are realized in the MS (or in 
the notebook connected to the MS) and the tokens come from the BSS. Actually, the 
USF values are the tokens transferred over the radio link. 

Another important difference in having the BSS as a first hop router is that within 
the BSS there is just an emulation of the system depicted in Fig. 5, as described later 
in this section. Therefore, the BSS only needs a software upgrade in order to act as a 
first hop router, which makes it easier for implementation. No complex data structures 
are required. For queue implementation, linked lists can be used. Timers, counters and 
constants are all that is needed to realize the constant fill rate of the token pools and 
the thresholds of the RIO queues. 

In the system described above, no packets do actually circulate, just requests for 
transfer. To be more precise, for each packet that the MS wants to transfer over the 
air, a pair (MS identity, service class) enters the above system. When the request exits 
the system then the BSS instructs the corresponding MS to transfer its packet by 
transmitting in a specified timeslot. The service class that a MS desires is declared 
with the use of the radio priority field at the TBF establishment request message. This 
field is two bits long, resulting into four values. We decided to have the following 
encoding: “1” for Premium Service, “2” for Assured Service and “3” for best-effort 
service. “0” specifies that the priority chosen at the PDP context activation will be 
used. The default value of the radio priority field is zero. 

When a pair is inserted into the system, three possible actions may occur: 

• the pair is forwarded to the appropriate output queue, if the counters of the 
Premium or Assured Service’s pools are bigger than zero, or if the priority 
chosen is equal to “3” 

• the pair is inserted into the waiting queue of Premium Service, if the 
corresponding counter is equal to zero, or 

• the pair is forwarded to the corresponding output queue with its priority set to 
“3”, if the Assured Service’s pool counter is equal to zero. 

If the priority chosen is zero, then the corresponding value in the pair inserted into the 
system will not be zero. Instead, the real value from the default PDP context is used. 

After the transmission of a packet (i.e., after four TDMA frames, since the packet 
is a radio block) the MS must make a new request to the BSS to transfer another 
packet. This makes clear that a TBF lasts for the transmission of only one radio block, 
after which the TBF is terminated and another one must be established to continue the 
transfer. 

The architecture described above provides good results in both directions of the 
radio link. On the downlink, when data enter the GPRS network in order to reach a 
mobile user, the traffic is either characterized with, or translated to, one of the 
available service classes (Premium, Assured, best-effort). This is done at the GGSN. 
If the neighbor PDN does not support Differentiated Services, then the GGSN tags the 
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incoming packets according to the profile of the user they are directed to. If, on the 
other hand, the neighbor PDN supports Differentiated Services, then the GGSN 
translates the incoming tags according to the SLA between the two DS domains. 

On the uplink, the mobile user is able to tag his IP packets, activate a service class 
during PDP context activation or request a service class during the TBF establishment 
phase. The decision of which method to use depends on the user and on the network 
and is discussed in the next section. 



5 Discussion 

In this section we discuss some issues concerning the proposals we made in this 
paper. One first issue concerns the transfer rate offered by the Premium Service. It is 
obvious that if the GPRS operator defines the Premium Service’s constant rate, then 
he can calculate how many simultaneous users a BSS can handle, taking into 
consideration the number of channels that the BSS serves, the number of timeslots in 
each frequency carrier assigned to GPRS traffic, the size of radio blocks and, for 
statistical decisions, user profiles. Thus, the operator will be able to perform Call 
Admission Control on Premium Service requests, which is required since this type of 
service is the only that offers strict guarantees. 

A second issue is the length of a TBF, in the case of adapting Differentiated 
Services to the GPRS environment. As described in Section 4, the length of a TBF is 
set equal to the time to transmit one radio block. This happens because it is necessary 
for the BSS to receive a request for every packet that must be transferred on the 
uplink. Furthermore, the BSS must know the radio priority of each packet. Since the 
radio priority is defined only during the establishment of the TBF, when the MS 
requests permission to transfer its data, the result is to limit the duration of a TBF to 
the transmission of one radio block. This makes the emulation system easier to 
implement and keeps the computational load to the BSS very low. However, it also 
results in an unnecessary use of extra TBFs (and TFIs) for the transfer of packets from 
the same MS. On the downlink things are simpler since the BSS is the one that does 
all the scheduling and buffering. 

Another important issue is which service class should be assigned to the IP packets 
that are reassembled at the GGSN and forwarded to the external network, in the case 
where Differentiated Services are also supported by the external PDN. There are 
many possibilities. The user’s application may use the “Type of Service” or the 
“Traffic Class” field of the IP packet to define what service should be used to the 
external network. Another solution is to use the default priority class defined at the 
PDP Context activation phase. The first solution gives the user the ability to have his 
packets treated differently inside and outside the GPRS network. The second solution 
allows the user to have his packets treated uniformly in both networks. It is desirable 
that the user should be able to make the final choice, so the GPRS network should 
probably implement both solutions. 

One last issue is the charging and pricing of such services. Although it is outside 
the scope of this paper, we should mention that the architecture described here 
enables charging using congestion pricing techniques. A first step in this direction is 
described in [12], where the existing congestion pricing theory is extended to the 
DiffServ environment described here. 
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6 Conclusions 

We have presented a way to apply the Differentiated Services framework to the GPRS 
wireless access environment. Our purpose was to enhance the GPRS network with 
QoS support that will be taken into consideration by the radio resource allocation 
procedures. For this purpose, the precedence QoS parameter and the radio-priority 
field were used, in combination with an adapted Two-bit Differentiated Services 
architecture. Note that the wireless access part is expected to be the most congested 
part of the GPRS network because of the scarcity of the wireless spectrum and 
therefore the part of the system where QoS support is most critical. At the same time, 
dynamic charging techniques can be combined with the service differentiation in 
order to make the resource allocation decisions efficient. 

With the proposed architecture, GPRS operators will be able to provide end-to-end 
service differentiation fully compatible with the rest of the Internet and in cooperation 
with content providers. Mobile users will be able to select what service they want to 
be used for the transfer of their data and they will be charged accordingly. Even if the 
external networks do not provide service differentiation, GPRS operators will manage 
to offer a first level of differentiation to the wireless access network that they own. 



Acknowledgments 

This research was supported by the European Union's Eifth Eramework Project M31 
(Market-Managed Multiservice Internet - RTD No IST-1999-11429). 



References 

1. R. Kalden, I. Meirick and M. Meyer, "Wireless Internet Access Based on GPRS," IEEE 
Personal Communications, vol. 7, no. 2, pp. 8-18, April 2000. 

2. C. Bettstetter, H.-J. Vogel, and J. Eberspacher, "GSM Phase 2+, General Packet Radio 
Service GPRS: Architecture, Protocols and Air Interface," IEEE Communications 
Surveys, vol. 2, no. 3, 1999 (http://www.comsoc.org/pubs/surveys/). 

3. GSM 02.60: “Digital cellular telecommunications system (Phase 2+); General Packet 
Radio Service (GPRS); Service Description; Stage 1” 

4. GSM 03.60: “Digital cellular telecommunications system (Phase 2+); General Packet 
Radio Service (GPRS); Service Description; Stage 2” 

5. GSM 04.60: “Digital cellular telecommunications system (Phase 2+); General Packet 
Radio Service (GPRS); Mobile Station (MS) - Base Station (BSS) Interface; Radio Link 
Control/Medium Access Control (RLC/MAC) protocol.” 

6. P.E. Chimento, "Tutorial on QoS support for IP," CTIT Technical Report 23, 1998. 

7. F. Baumgartner, T. Braun, P. Habegger, "Differentiated Services: A new approach for 
Quality of Service in the Internet," Proceedings of Eighth International Conference on 
High Performance Networking, Vienna, Austria, 21-25 Sept. 1998. Edited by: Van As, 
H.R., Norwell, MA, USA: Kluwer Academic Publishers, 1998. p. 255-73. 

8. J. Heinane, F. Baker, W. Weiss, I. Wroclawski, "Assured Forwarding PHB Group," RFC 
2597, June 1999. 

9. V. Jacobson, K. Nichols, K. Poduri, "An Expedited Forwarding PHB," RFC 2598, 
February 1999. 




Differentiated Services in the GPRS Wireless Access Environment 119 



10. K. Nichols, V. Jacohson, L. Zhang, "A Two-bit Differentiated Services Architecture for 
the Internet," RFC 2638, July 1999. 

11. G. Priggouris, S. Hadjiefthymiades, L. Merakos, "Supporting IP QoS in the General 
Packet Radio Service," IEEE Network, Sept.-Oct. 2000, vol.I4, (no.5), p. 8-17. 

12. S. Soursos, "Enhancing the GPRS Environment with Differentiated Services and 
Applying Congestion Pricing," M.Sc. thesis. Dept, of Informatics, Athens University of 
Economics and Business, February 2001. 




Wireless Access to Internet via IEEE 802.11: 
An Optimal Control Strategy for Minimizing 
the Energy Consumptioifl 



R. Bruno, M. Conti, and E. Gregori 

Consiglio Nazionale delle Ricerche 
Istituto CNUCE 

Via G. Moruzzi, 1, 56124 Pisa - Italy 
Tel: (050) 315 3062, Fax: (050) 3138091, 

{Marco . Conti , Enrico . Gregori }@cnuce . cnr . it 
Raf faele . Bruno@guest . cnuce . cnr . it 



Abstract. The IEEE 802. 1 1 standard is the most mature technology to provide 
wireless connectivity for fixed, portable and moving stations within a local area. 
Wireless communications and the mobile nature of devices involved in con- 
structing WLANs generate new research issues compared with wired networks: 
dynamic topologies, limited bandwidth, energy-constrained operations, noisy 
channel. In this paper, we deal with the issue of minimizing the energy con- 
sumed by each station to perform a successful transmission. Specifically, by 
exploiting analytical formulas for the energy consumption, we derive the theo- 
retical lower bound for the energy consumed to successfully transmit a mes- 
sage. This knowledge allows us to define a novel transmission control strategy 
based on simple and low-cost energy consumption estimates that permits each 
station to optimize at run-time its power utilization. Our strategy is completely 
distributed and it does not require any information on the number of stations in 
the network. Simulation results prove the effectiveness of our transmission con- 
trol strategy. Specifically, the IEEE 802.11 extended with our algorithm ap- 
proaches the theoretical lower bound for the energy consumption in all the con- 
figurations analyzed. 

Keywords: power saving, Wireless LAN (WLAN), MAC protocol, IEEE 
802.11, analytical modeling, performance analysis 



1. Introduction 

In the near future we will witness to a rapid growth in the need to have a mobile and 
ubiquitous connection to the Internet information services. Hence, the wireless tech- 
nologies will become more an more utilized as means to access the Internet . In this 
paper we focus our attention on the Wireless Local Area Networks (WLANs) tech- 
nologies, which are designed to provide wireless connectivity for fixed, portable and 
moving stations within a local area. A key success factor of WLANs is connected to 
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the availability of global standards to develop networking products that can provide 
wireless network access at competitive price. In this sense, the most mature technol- 
ogy is the one defined by the IEEE 802.11 standard [8], which follows the Carrier 
Sensing Multiple Access with Collision Avoidance (CSMA/CA) paradigm. 

Two different approaches can be followed in the implementation of a WLAN: an 
infrastructure-based approach, or an ad hoc networking one [10]. An infrastructure- 
based architecture imposes the existence of a centralized controller for each cell, often 
referred to as Access Point. The Access Point is normally connected to the wired 
network thus providing the Internet access to mobile devices. In contrast, an ad hoc 
network is a peer-to-peer network formed by a set of stations within the range of each 
other that dynamically configure themselves to set up a temporary network. The IEEE 

802.11 can be utilized to implement both wireless infrastructure networks and wire- 
less ad hoc networks. The IEEE 802. 1 1 WLAN is a single-hop ad hoc network. How- 
ever, it is emerging also as one of the most promising technologies for constructing 
multi-hop mobile ad hoc networks [6] . 

Wireless communications and the mobile nature of devices involved in construct- 
ing WLANs generate new research issues compared with wired networks: dynamic 
topologies, limited bandwidth, energy-constrained operations, noisy channel. In 
WLANs, the medium access control (MAC) protocol is the main element that deter- 
mines the efficiency of the resource utilization, since it performs the coordination of 
transmissions of the network stations and manages the congestion situations that may 
occur inside the network. The congestion level in the network negatively affects both 
the link utilization, i.e., the fraction of channel bandwidth used from successfully 
transmitted messages, and the energy consumed to successfully transmit a message. 
Specifically, each collision removes a considerable amount of channel bandwidth 
from that available for successful transmissions. At the same way, each collision 
represents significant energy wastage, since transmitting data is one of the most 
power consuming activities to perform. To reduce the collision probability, the IEEE 

802.11 protocol uses a set of slotted windows for the backoff, whose size doubles 
after each collision. However, the time spreading of the accesses that the standard 
backoff procedure accomplishes can have a negative impact on both the link utiliza- 
tion and the energy consumption. Specifically, the time spreading of the accesses can 
introduce large delays in the message transmissions and additional energy wastage 
due to the carrier sensing. Eurthermore, the time spreading is obtained at the cost of 
second collisions. Previous works have shown that an appropriate tuning of the IEEE 

802.11 backoff algorithm can significantly increase the protocol capacity [4], [3]. 
Specifically, in [3], by exploiting exact analytical formulas for the link utilization, we 
define a backoff tuning algorithm based on simple and low-cost load estimates (as 
they are obtained from the information provided by the carrier sensing mechanism, 
i.e., by observing idle slots, collisions and successful transmissions) that enables each 
station to estimate at run-time the average size of the backoff window that permits to 
achieve the theoretical upper bound for the link utilization. 

In this paper, we deal with the issue of minimizing the energy consumed by each 
station to perform a successful transmission. Specifically, this work is based on the 
approach followed in [3], but we will extend it by adding power-saving features to our 
backoff tuning algorithm. The impact of network technologies and interfaces on the 
energy consumption has been investigated in depth in [1], [7]. The power saving fea- 
tures of the emerging standards for WLANs have been analyzed in [11], [5]. Distrib- 
uted (i.e., independently executed by each station, hence fitting to the ad hoc network- 
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ing paradigm) strategies for power saving have been proposed and investigated in [2], 
[9]. Specifically, in [9] the authors propose a power controlled wireless MAC proto- 
col based on a fine-tuning of network interface transmitting power. In [2] the authors 
propose a mechanism that dynamically adapts, by estimating the slot utilization and 
the average message length, the time spreading of accesses to asymptotically ap- 
proach the minimum energy consumption for a large network population. As in [2], 
we exploit exact analytical formulas for the energy consumption to derive the theo- 
retical lower bound for the energy consumed to successfully transmit a message. The 
knowledge of the station behavior that minimizes the energy consumption allow us to 
define a novel transmission control strategy based on simple and low-cost energy 
consumption estimates that permits each station to optimize at run-time its power 
utilization. Our strategy does not require any information about the network popula- 
tion, but it is adaptive to the number of stations in the network. Simulation results 
prove the effectiveness of our transmission control strategy to approach the theoretical 
lower bound for the energy consumption in an IEEE 802. 1 1 network in all the con- 
figurations analyzed. 

The rest of the paper is organized as follows. Section 2 describes the protocol de- 
tails of the wireless MAC protocol considered. Section 3 discusses the analytical 
model for the energy consumption required to successfully transmit a MAC frame. 
Section 4 defines the Power Saving Simple Dynamic 802.11 Protocol (PS-SDP). In 
Section 5, PS-SDP 802.11 is compared to standard IEEE 802.11 protocol. Finally, 
Section 6 summarizes key results. 



2. Description of the p-persistent IEEE 802 .11 Protocol 

The IEEE 802.11 MAC protocol provides asynchronous, time-bounded and conten- 
tion free access control on a variety of physical layers. The basic access method in the 
IEEE 802. 1 1 MAC protocol is the Distributed Coordination Function (DCF) which is 
a Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) MAC proto- 
col. An exhaustive description of the DCF MAC protocol features can be found in [8]. 
In [4] it has been shown that the performances of the standard protocol can be derived 
by analyzing the corresponding p-persistent IEEE 802.11 protocol. The p-persistent 
IEEE 802. 1 1 protocol differs from the standard protocol only in the selection of the 
backoff interval: at the beginning of an empty slot a station transmits (in that slot) 
with a probability p , while the transmission differs with a probability I - p , and then 
repeats the procedure at the next empty slot. Hence, in this protocol the average back- 
off time is completely identified by the p value. It is worth remembering that choos- 
ing a p value is equivalent to identify, in the standard protocol, the average backoff 
window size [4]. This means that the procedure analyzed in this paper to tune the p- 
persistent IEEE 802. 1 1 protocol by observing the network status, can be exploited in 
an IEEE 802. 1 1 network to select, for a given congestion level, the appropriate size of 
the contention window. Due to the equivalency between the standard IEEE 802.11 
protocol and the p-persistent IEEE 802.11 protocol, we will provide an analytical 
model of the p-persistent protocol. 

In the rest of this section we detail the p-persistent IEEE 802. 1 1 protocol behavior. 
Before a station initiates a transmission, it senses the channel to determine whether 
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another station is transmitting. If the medium is found to be idle for an interval that 
exceeds the Distributed InterFmme Space (DIFS), the station continues with its 
transmission. On the other hand (i.e., the medium is busy), the transmission is de- 
ferred until the end of the ongoing transmission. The idle time immediately following 
an idle DIFS after the transmission completion is slotted, and a station is allowed to 
transmit only at the beginning of each slot time. Hereafter, we will refer to the slot 
time duration as t . The t is equal to the time needed at any station to detect the 
transmission of a packet from any other station. The decision to begin a transmission 
or to defer the transmission to the next empty slot is accomplished according to the 
transmission probability p . Immediate positive acknowledgements are employed to 
ascertain the successful reception of each packet transmission^ This is accomplished 
by the receiver (immediately following the reception of the data frame), which initi- 
ates the transmission of an acknowledgement frame (ACK) after a time interval. Short 
InterFrame Space (SIFS), which is less than the DIFS. If an acknowledgement is not 
received, the data frame is presumed to have been lost and a retransmission is sched- 
uled. 

The model used in this paper to evaluate the performance figures does not depend 
on the technology adopted at the physical layer (e.g., infrared and spread spectrum). 
However, the physical layer technology determines some network parameter values, 
e.g., SIFS, DIFS and In Table 1 we report the parameter setting we will adopt in 
all the numerical evaluation and simulation runs performed in this paper. The choice 
of the values for the technology-dependent parameter is compliant to the frequency- 
hopping-spread-spectrum technology at a 2 Mbit/s transmission rate [8]. 



Table 1. Parameter setting for the p-persistent IEEE 802.1 1 protocol. 



t , 

slot 


Propagation 
Delay ( T ) 


DIFS 


SIFS 


ACK 


Bit Rate 


50 psec 


<1 psec 


2-56 


0-56 


112 bits 


2 Mbps 



3. Power Consumption Analysis 

The analysis of the energy consumption in a p-persistent IEEE 802. 1 1 network has 
been performed in [2]. Specifically, in that paper, the authors focus their attention on 
a tagged station and observe the system at the end of each successful transmission 
attempt of the tagged station. By assuming that the message lengths are random vari- 
ables i.i.d., and considering the p-persistent protocol behavior, it follows that all the 
processes that characterize the occupancy pattern of the channel (i.e., idle periods, 
collisions and successful transmissions) are regenerative with respect to the time in- 
stants corresponding to the completion of the tagged- station successful transmissions^ 
Therefore, the energy consumption analysis can be performed by studying the system 



' Let us remeind that CSMA/CA does not rely on the capability of the stations to detect a colli- 
sion by hearing their own transmission 

^ Hereafter, we will assume that successive transmission attempts of a station have independent 
lengths. 



124 R. Bruno, M. Conti, and E. Gregori 



behavior in a generic renewal period, also referred to as virtual transmission time. By 
assuming that PTX and PRX are the power consumptions (expressed in mW) of the 
network interface during the transmitting and receiving phase, respectively, then the 
average energy (in mJ) required to a station to perform a successful transmission can 
be expressed as follows [2]: 

E[Energy] = £[A?c+ 1] • (1) 

+ E[Nc\- E\ Energy E\Energy 1 

where, ^[Wc] is the average number of collisions experienced by the tagged station 
during two successful transmissions, ] is the average number of 

notjused slots (from the tagged station standpoint) before the transmission attempt of 
the tagged station, E[Energy^^^ ] is the average energy consumption during 
not_used slots, E[Energy^^^^^^ average energy consumption during colli- 
sions, conditioned to observe a tagged-station collision, and E\Energy 1 is 

lagged _ success 

the average energy consumption during tagged- station successful transmissions. 

To correctly evaluate the energy consumed during both a collision and a successful 
transmission, we have to take in account the protocol overheads introduced. Let us to 
denote with Collision the average length of a collision (not involving the tagged 
station), and with S the average length of a successful transmission, including their 
overheads. Hence, according to the protocol behavior described in Section 2 and con- 
sidering a geometric distribution with parameter q for the message length (expressed 
as number of f ^ ), it follows (see [4] for the proofs): 

S < 2t -I- SIFS + ACK + DIFS + (l/( 1 - ^)) (2.a) 



Collision <x + DIFS + 



2 {,[(1 - p,T - (1 - p ,- )-]]- 

- *=i 1 ~ 



The unknown quantities in Equation (1) are given by the following Lemma (see [2] 
for the proofs). 



Lemma 1. In a network with M stations, by assuming that each station is operating 
in asymptotic conditions (i.e., stations have always at least a packet waiting to be 
transmitted), by denoting with p the transmission probability adopted by each station, 
and with q the parameter of the geometric distribution that defines the message 
length (expressed as number of f ^ ), it follows: 

1 - (1 - r 1 _ 1 - p ( 3 ) 

a \A/-l ’ not _ used _sU>t\ 

- p) 




P 
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J = PRX [t Jl- pr-^ + 5 • (M- 1)/,(I - + 
Collision • (1 - (1 - p) ( M - 1) p(l - p) 

PTX ■ . (V(l- ?))+ PRX- f„„, • q) ■ 

x=\ 



ir- 

Ly=l 






+ DIFS + T 



(4) 

(5) 



E\Energy 1 = PTX • (l/(l - «)) + PRX • ( 2t + SIFS + ACK + DIFS) ( 6 ) 

By assuming that all the times are expressed in , and that also PTX and PRX are 
expressed in mW/t ^ , then E[Energy] (in mJ) is a f(p,q,M,PTX,PRX) . Hereafter, 
we refer to p as the p value that minimizes the energy consumption, fixed M , q 
and the PTX and PRX system parameters. We also assume that a network is in its 
optimal operating state, if each station adopts the p^^^^ value as its transmission prob- 
ability. The p^^^^ values for several settings of the proposed parameters can be com- 
puted by numerically minimizing the Equation (1). 



Virtual Transmission Time 



±1 






■ 


Tagged Success 




Not tagged Success 




Tagged/Not tagged 


Collision 


□ 


DIFS 



Fig. 1. Channel stmcture during to successful transmission of the tagged station. 



The optimization of the station’s power utilization requires the knowledge of the p 
value. However, it is computationally expensive to afford at run time the minimiza- 
tion of Equation (1). Eurthermore, the minimization of Equation (1) requires a robust 
estimation of the M parameter, i.e., the number of stations in the network, but it is 
extremely complex and unreliable to retrieve this information from the channel status 
in a C^MA-like protocol. Hence, it would be convenient to find out a simpler relation- 
ship to provide an approximation of the p^^^^ value. To this end, in the following we 
will investigate the role of the various terms of Equation (1) in determining the over- 
all energy consumption. Specifically, we will separate the energy consumptions that 
are increasing function of the p value from the energy consumptions that are decreas- 
ing function of the p value. 
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In Figure 1 we have plotted the sequence of idle periods, collisions and successful 
transmissions that occur on the channel between two consecutive tagged-station suc- 
cessful transmissions. During the virtual transmission time, we observe on the channel 
an average number of not tagged-station successful transmissions equal to M - 1 due 
to the symmetry of the system. Specifically, none station can be privileged during the 
access to the channel, hence all the stations have the same probability to experience a 
successful transmission. Therefore, the successful transmissions, from an energy 
consumption standpoint, have a cost that is invariant with the p value, so we cannot 
minimize the energy consumed by a station during successful transmissions, but only 
the energy consumed by a station during idle periods and collisions. Specifically, 
when the channel is not occupied by successful transmissions (i.e., the white fraction 
of the channel, as shown in Figure 1) we have a sequence of tagged- station and not 
tagged- station collisions that follow idle periods. Hereafter, we refer to the average 
energy consumed by a station during two consecutive transmission attempts as 
E\ Enerey 1 • Since all the processes that characterize the occupancy pattern 

L O./ transmission _attempt -• x »' i. 

on the channel are still regenerative between two transmission attempts, we derive the 
closed formula for £[ Energy ] by using the regenerative property (see 
Appendix A for the closed formulas): 

E\Enerey ^ = E\Energy ^ + E\ Energy T PrliwcceM I A > l|-l- (7) 

L transmission _aitempl -i L Idte_pJ L success J *- ir -* ^ ' 

E\ Energy x YrUagged collision \N > l} -I- 

*- tagged _ collision -• *- « ^ j 



where E[Energy ^ ] is the average energy consumed by a generic station listening 
the channel, the second term in the right-hand side is the average energy consumed 
during a generic success conditioned to the observation of a transmission attempt, let 
it say the third term in the right-hand side is the average energy 

consumed during tagged-station collisions conditioned to the observation of a trans- 
mission attempt^] let it say £[ Energy *^be fourth term in the right- 



hand side is the average energy consumed during not tagged-station collisions, condi- 
tioned to the observation of a transmission attempt, let it say 



E[Energy^^^ 



tagged _ collisiod transm 



]■ 



It is straightforward to derive that E[Energy p] is ^ decreasing function of the p 
value, whereas E[Energy , ] and ElEnergy , ] are increas- 

'■ tagged _ collisiod transm'' not _tagged _collisioritransm'* 

ing functions of the p value. In [3] and [4], we have shown that the p value that 
maximizes the throughput is well approximated by the p value that permits to have 
the average time spent by sensing the channel equal to the average time the channel is 
occupied by collisions Following the same approach, we suggest that the minimum 
energy consumption is achieved when each station adopts a transmission probability 
that permits to have the average energy spent by sensing the channel equal to the 



^ Since we will not take in account the energy consumed during the successful transmissions, in 
Appendix A we avoid to report the formula related to E[Energy_^^^ 



Wireless Access to Internet via IEEE 802.1 1 127 



average energy wasted during collisions. We can express this condition with the fol- 
lowing relationship: 

_ J= ]-l- ] (8) 

The right-hand side in Equation (8) represents the average energy consumption during 
a generic collision, given a transmission attempt, i.e., £[Energyp^„, . To easier 

compute E[Energy^^„.^.^J we further expand the ¥x{tagged _collmon I ^ l} and 

Pr {not _tagge<i_ coZ/wion I > l} expressions by conditioning on the tagged- 
station transmission, it say tag _tr , and the not tagged-station transmission, say 
not _tag_tr, respectively. Equation (8) can be written as (see Appendix A for the 
closed formulas): 

E[Energy^^^^ Pr{tagged _collision \ tag_tr} ■ (9) 

¥x{tag_tr I A > + 

^r{not _tagged _ collision \not _tag _tr} ■Vr\not _tag _tr\ > l} 

As previously said the approach followed to derive Equation (9) is similar to the one 
followed in [4], [3] to maximize the throughput. Therefore, afterwards we will pro- 
vide further considerations on the relationship between the energy-consumption 
minimization problem and the throughput maximization problem. 



Table 2. Energy consumption analysis ( PTX = 2 , PRX = 1 ). 



m 






value 




Minimum Energy Consumption 


Exact 

Value 


Approximated 

Value 


Exact 

Valne 


Approximated 

Valne 




M=10 


M=100 


M=10 


M=100 


M=10 


M=100 


M=10 


M=100 


2 


5.08e- 

2 


5.10e-3 


5.43e- 

2 


5.54e-3 


99.4944 


987.412 


99.6043 


988.999 


5 


3.84e- 

2 


3.84e-3 


4.06e- 

2 


4.14e-3 


144.895 


1411.79 


144.988 


1413.14 


10 


2.98e- 

2 


3.01e-3 


3.11e- 

2 


3.17e-3 


214.851 


2063.92 


214.928 


2065.03 


20 


2.24e- 

2 


2.26e-3 


2.32e- 

2 


2.36e-3 


346.766 


3290.21 


346.828 


3291.11 


50 


1.49e- 

2 


1.51e-3 


1.53e- 

2 


1.55e-3 


721.198 


6759.72 


721.244 


6760.38 


100 


1.08e- 

2 


1.09e-3 


l.lOe- 

2 


1.12e-3 


1321.73 


12310.1 


1321.77 


12310.6 



In Tables 2 and 3 we compare the minimum energy consumption, calculated by 
minimizing Equation (1), and the energy consumption measured when each stations 
adopts the p value that satisfies Equation (9). In Tables 2 and 3 we also compare the 
p value that satisfies Equation (9) with the value. The numerical results are ob- 
tained by assuming PRX = 1 and considering either PTX = 2 or PTX = 10 [7]. To 
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fix PRX = 1 is done to avoid useless details. Specifically, our energy units is the 
energy consumption in a f when the network interface is in the receiving state. 



Table 3. Energy consumption analysis ( PTX = 10, PRX = 1 ). 



p value Minimum Energy Consumption 

Exact Approximated Exact Approximated 

Value Value Value Value 





M=10 


M=100 


M=10 


M=100 


M=10 


M=100 


M=10 


M=100 


2 


4.15e-2 


4.96e-3 


4.42e-2 


5.38e-3 


123.874 


1013.78 


123.987 


1015.39 


5 


2.99e-2 


3.74e-3 


3.14e-2 


3.99e-3 


199.335 


1470.15 


199.425 


1471.51 


10 


2.25e-2 


2.88e-3 


2.34e-2 


3.04e-3 


315.967 


2171.08 


316.039 


2172.20 


20 


1.66e-2 


2.16e-3 


1.71e-2 


2.26e-3 


537.148 


3489.52 


537.204 


2490.41 


50 


1.09e-2 


1.44e-3 


l.lle-2 


1.48e-3 


1169.70 


7222.64 


1169.74 


7223.30 


100 


7.88e-3 


1.04e-3 


8.00e-3 


1.06e-3 


2190.53 


13199.3 


2190.56 


13199.9 



The results reported in the Tables 2 and 3 show that Equation (8) gives a good ap- 
proximation of the minimum energy consumption for all the network parameter set- 
tings we have considered. It is worth pointing out that the numerical results show that 
a precise approximation of the minimum energy consumption does not require the 
same accuracy level for the approximation. 

From the discussion performed so far, it results that the energy consumption mini- 
mization can be achieved not only by computing the p value that minimizes Equation 
(1), but it is sufficient to identify the p value that permits to have E[Energy^^^^ 
equal to E[£’nergy^ J . By exploiting the latter, we are able to define an efficient 
and simple transmission control strategy, see Section 4. In the remaining of this sec- 
tion, we will give an insight in some of the most relevant characteristics of the net- 
work behavior when each station operates in such a way to minimize the energy con- 
sumption. This analysis is significant not only to investigate the network characteris- 
tics, which our transmission control strategy would be based on, but also to point out 
how the energy-consumption minimization issue is related to (and differs from) the 
throughput maximization issue. 

First of all, we give some approximations for the complex expressions for 
E[Energy^^^^^^ E[Energy when the network is in its optimal 

operating state. Specifically, we can assume that the collision probability is low when 
the network is in its optimal operating state. Hence, the approximation for the colli- 
sion probability is obtained by assuming that no more than two stations collide. Ac- 
cording to this assumption, for a geometric message-length distribution, it follows: 



E\Energy^^ 



PRX- 



\ + 2q — q — 2q^ 

- il-qYil + qy 



-I- DIFS + T 



(10.a) 



= E. 



E^nergy^ 



1 - <7 L 



PTX + PRX ■ ■ 



\ + qA 



+ PRX ■ {DIFS +t:) = E , 



(lO.b) 
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In Tables 4 and 5 we compare the exact values of the average energy consumed dur- 
ing tagged-station collisions, given by relationship (5), and during not tagged-station 
collisions, given by relationship (A.2), with the approximations provided by (10. a) 
and (lO.b), respectively, when all the stations adopt the value as their transmis- 
sion probability. 



Table 4. E[ Energy ^ 

*- logged _ collision -* 



m 


PTxj PRX = 


2 


PTx! PRX = 


10 


it. 1.1) 


Exact Value 


Approx. 


Exact Value 


Approx. 


M=10 


M=100 


Value 


M=10 


M=100 


Value 


2 


7.34877 


7.37080 


7.246667 


23.32946 


23.36726 


23.24667 


10 


27.7045 


27.7934 


27.31684 


107.6080 


107.7734 


107.3164 


100 


253.757 


254.098 


252.3287 


1053.358 


1054.013 


1052.328 



The numerical results reported in Tables 4 and 5, confirm that expressions (10. a) and 
(lO.b) provide precise approximations of E[Energy^^^^^^ 

E[Energy 1 . Furthermore, the numerical results show that both 

E\Energy 1 and E\Energy 1 do not depend significantly on the 

M value, but only on the average packet length, and on the PTX and PRX values. 
This characteristic will be used during the definition of our transmission control strat- 
egy in Section 4. 



Tables. E[ Energy , 

*- ' not _ lagged _ collision -* 



m 

it.j 


Exact Value PTX j PRX = 2 


Exact Value PTXf PRX = 10 


Approx. 

Value 


M=10 


M=100 


M=10 


M=100 


2 


5.30675 


5.32942 


5.29517 


5.32702 


5.246667 


10 


17.5174 


17.6106 


17.4525 


17.5966 


17.31684 


100 


152.015 


152.556 


151.367 


152.456 


152.3287 



Let us now focus on the average number of stations that try to access the channel at 
the same time to achieve the minimum energy consumption. It is straightforward to 
derive that this average number is given by the product. 

Figures 2. a and 2.b show the product for an average message length m equal 
to 2 and 100 time slots, respectively. In each figure we plot the value related to 
different PTXj PRX values. The curves labeled with PTXj PRX = 1 correspond to the 
average number of transmitting stations that maximizes the throughput. Specifically, 
with PTXj PRX = 1 , Equation (1) reduces to the average length of a virtual transmis- 
sion time. The minimization of Equation (1) corresponds to minimize the time neces- 
sary to complete a successful transmission for the tagged station, hence to maximize 
the throughput. 
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(a) m = 2r, . (b) m = lOOr . 

' slot ' ' slot 



Fig. 2. The product. 

From the plotted curves we derive that the product exhibit a dependency on the 
M value for small network-size population. This effect is more marked when 
PTXf PRX =10. Specifically, for PTXf PRX = 2, the Mp^^ product can be consid- 
ered a constant function independent of the M value. In this case, the energy- 
consumption minimization problem is very similar to the throughput maximization 
problem, due to the not significant difference between the Mp values. In other 
words, to adopt the value as transmission probability, beyond providing power 
saving, contributes to obtain quasi-optimal channel utilization. Instead, for 
PTXf PRX =10, the Mp^^ product shows a significant dependency on the M value. 
This dependency reduces only for large M value (e.g., 50)- This behavior is 

explained by observing that the energy consumed during not tagged collisions de- 
pends only on the PRX parameter, as E{Energy^^^^ ^], whereas the PTX parameter 
has a significant impact in the energy consumed during tagged collisions. However 
(in a large network) the probability to have a not tagged collision is much higher than 
the probability to have a tagged collision, hence in this case the impact of the PTX 
parameter is significantly reduced. 

From Figures 2. a and 2.b it follows that the transmission probability that minimizes 
the energy consumption is always lower than the transmission probability that permits 
to maximize the throughput. To better explain this behavior, we refer to Figure 3 
where we have plotted E[Energy^jj^ and E[Energy^^jj.^.^^J versus the p value, for 
various PTXj PRX values. E[Energy^^^^ is equal to E[Idle_p] due to the assump- 
tion of PRX = 1 . At the same way, the E[Energy^^^ii.^.^^J related to PTXj PRX =1 is 
equal to the average length of a collision given a transmission attempt, also referred to 
as E{Coir\ . The p value that corresponds to the intersection point of the 
E[Energy and , J curves is the approximation of the value. It 

is worth remembering that for PTXj PRX = I the p value that corresponds to the 
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intersection point of the E[Idle _ p] with E[Coll] provides a good approximation of 
the p value that maximizes the throughput (see [4]). We note that hy increasing the 
PTX value also J grows due to the rise in the energy consumed during 

tagged collisions. However, ElEnergy^^i^ does not depend on the PTX value, 
hence, only a decrease in the p value can balance the increase in J • 




p value 

Fig- 3- approximation with M = 10 and m = ^ . 



4. The Power Saving - Simple Dynamic IEEE 802.11 Protocol 
{PS-SDP) 

In this section, we will define the Power Saving - Simple Dynamic IEEE 802.1 1 pro- 
tocol (PS-SDP), that is the standard p-persistent IEEE 802.11 protocol enhanced by a 
transmission control strategy based on the run-time estimate of the p^^^^ value. Le us 
remind that in [3] we have defined a Simple Dynamic IEEE 802.11 protocol to 
achieve the throughput maximization. The PS-SDP expends SDP with new power- 
saving features. To define a transmission control strategy that would not be cumber- 
some, we have to rewrite Equation (9) in a simpler, even though approximated, way. 
To this end we introduce the following polynomial approximations: 

F^{p) = {\-pT =^\-Mp + o{p^^ (11-a) 

Gj^(p) = Mp(l — p)'^ ' ~ Mp — M{M — l)p^ + o{p^^ (11-b) 

By exploiting relationships ( 1 1 .a) and ( 1 1 .b), it is straightforward to derive: 

Pv {tagged _collision \tag_tr{= \-F^ ^{p) ~(M— \)p 



(12.a) 
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Vr{not _tagged _collision \ not _tag_tr\ = 



1-K,(P) + G>,-,(P)1 



{M-2)p 



(12.b) 



The approximations provided by relationships (12. a) and (12.b) are not intended to be 
the optimal approximation for the Pr\tagged _collision \ l} and 

Pr\not _tagged_collision\ ^ l}, but they well show the dominant role of the 
Mp product in determining these probabilities. By exploiting relationships (11. a) and 
(ll.b), and by assuming that Pr{tag_fr I > l}~ l/^^ and 
Pr \not _ tag_ tr I > l} = (M - 1 )/ mO it follows: 



E[Energy^^^^J 



PRX- 



1- Mp 
Mp 



(13.a) 



E[Energy ^ E„-{M -\)p- — 

M 



(13.b) 



E[Energy^^^ 



tagged _ coUisiori transm 



/ , M-1 

■{M-2)p- 

M 



(13.C) 



We cannot directly exploit relationships (13.a)-(13.c) to find out the p^^^ value, due to 
the presence of the unknown M parameter. However, by exploiting relationships 
(13.a)-(13.c) we can rewrite Equation (9) in such a way that the estimation does 
not require any information about the M value, or its distribution. 

Let us consider the n-th transmission attempt, and say p^ the transmission prob- 
ability adopted by all the stations during the n-th transmission attempt. From relation- 
ships (13.a)-(13.c), it follows that at the end of the n-th transmission attempt: 

I - 1 - Mp^ (14.a 

E[Energy,^,^ ] = £[£nergy„,^ ]„ = PRX ■ — ) 

Mp^ 



1 - ,1-, ,M-1 (I4.b 

E[Energy^^„\ « E[Energy^J^ = E^-{M - \)p^ — + E^^ - {M -2)p^ ■ , 

M M > 

If p^ t- p^^^, then E[Energy^ji^ p]„ ^ Equation (9) does not hold. 

For the (n-i-l)-th transmission attempt our algorithm searches a new transmission 
probability p^^^ such as to balance in the future the energy consumed during idle 

periods and collisions, namely to have E[Energy^ji^ = £[£nergyp^„]^_^j . to this 
end, we first express as a function of an unknown quantity x, such that 

p^,, = p(l+x) . Then, by assuming E[Energy^^^^_^]^^^ = E[Energy^J^^^, from 
(14. a) and (14.b), after some algebraic manipulations, we obtain: 



As lower is the p value as more correct as these assumptions 
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(1 + x) 



(15) 



I 



1+ 4PRX(l + E[Me_p]) 



^ E[Energy,^^^^^ 

collisiott |fTO.vm ] ^ E\ E 

M ^ 



not _ tagged _ collisiorn trasn 



] -1 



2PRX(1 + EX Idle_ p ] ) 



E[Energy^ 



la^^ed_^ollision\J^ 



M 



+ E^nergy^ 



not tagged collision \trasm 



The relationship (15) is valid for M» 1. Hence, to completely eliminate the algo- 
rithm dependency on the M value, we will do the conservative assumption that 
M = 10 . It means that the ten percent of E[Enerey , ] always contributes 

to the (1 -t x) evaluation. This is a conservative assumption because the percentage of 
E\ Energy , „ . , 1 that impacts the (1-l-x) evaluation decreases below the ten 

percent for M values grater than ten. However, we believe that this conservative 
behavior still guarantees a significant improvement of protocol efficiency, and our 
assessment will be extensively validated through the performance analysis executed 
in the following section. 



5. Performance Analysis in Steady-State Conditions 

In this section we investigate, via simulation, the performances achieved by the PS- 
SDP when the network operates in steady state conditions. In our simulation we as- 
sumed an ideal channel with no transmission errors and no hidden terminals, i.e., all 
the stations can always hear all the others. The effectiveness of PS-SDP has been 
analyzed with different network configurations. Specifically, we run simulation ex- 
periments for several network populations (i.e., M g [5. ..100] ) and message lengths 
(m=2t, and m = loot, )■ 

slot slot 

As we have previously explained, PS-SDP operates at the completion of each 
transmission attempt with the target to adjust the p value so that Equation (9) holds. 
The updating rule provided by relationship (15), requires the knowledge of the aver- 
age length of idle periods, i.e., E[Idle_p], the average energy consumed during 
tagged- station collisions given a transmission attempt, i.e., E[Energy > 
and the average energy consumed during not tagged-station collisions given a trans- 
mission attempt, i.e., E[Energy , ]. Each station, by using the carrier 

«- cj ^ ^ tagged _ collision transm j 

sensing mechanism can observe the channel status and measure the length of idle 
periods and busy periods. In the latter case, we assume that It can distinguish success- 
ful transmission from collisions by observing the ACK. Among collisions, each sta- 
tion obviously knows the collisions in which it is involved or not. Erom these values, 
E[Idle_p], can be approxi- 

mated by exploiting a moving average window: 

E[ldle _p]„ = a ■ E[ldle _ p]„_^ -i- (l - a) • Idle_p^ 



(16.a) 
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E^nergy^ 



tagged _ collision \irasm 



= a • E^nergy 



tagged _ collision \tras 



Y +{\-a)[PTX- 



(16.b) 



Coll _ tagged^ + PRX ■ max (O, Coll^ - Coll _ tagged ^ )] 



ExEnergy \ = a - ExEnergy + 

L not _ tagged _ collisionXtrasm L not _ tagged _collision\trasm 

(1 - a ) • PRX ■ Coll _ not_ tagged^ 



(16.C) 



where: 

• E[ldle_p]„, ElEnergy , ] and E\Energy , 1 are the 

*■ * ■* tagged _collisiotitrasm n not _tagged _collisiontrasm n 

approximations at the end of the n-th transmission attempt; 

• Idle_p^ is the length of the n-th idle period; 

• Coll^ is zero if either the n-th transmission attempt is successful or a not tagged- 
station collision, otherwise it is the collision length; 

• Coll _ tagged^ is zero if either the n-th transmission attempt is successful or a not 
tagged-station collision, otherwise it is length of the tagged-station transmission; 

• Coll _ not _ tagged^^ is zero if either the n-th transmission attempt is successful or a 
tagged-station collision, otherwise it is the collision length; 

• a G [0,1] is a smoothing factor. 

The use of a smoothing factor, a , is widespread in the network protocols to obtain 
reliable estimates from the network estimates by avoiding harmful fluctuations. In the 
following we summarize the steps performed independently by each station to com- 
pute the value for the current network and load conditions, given the p^ value. 



begin 

step 1: measure of the n-th idle period Idle_p ; 

step 2: measure of Coll^ , Coll _tagged^ and Coll _not _tagged^- 

step 3: update of £[Wfe_p]^ , ^nd 

E\Energy ] [by exploiting (16 . a) - (16 . c) ] ; 

not _ lagged _ cotUsision tirasm 



Step 4: calculate 

P new Ph' 



+ 4PRX( 1 + £t Me _ p] „) 


V 10 


— + 1 j - 1 


2PRX(l+ EUdle_p]„)\^ 


10 





Steps: =a + (l -a)-p_ 



end 
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In Figures 4. a and 4.b we compare (for an average message length equal to ) the 
energy consumption, measured in the standard IEEE 802.11 protocol (STD 802.11) 
and the PS-SDP against the theoretical lower hound (OPT 802.11). In figures 4. a and 
4.b we vary the network configuration (i.e., M g [5. ..100] ) and the network interface 
characteristics (i.e., PTXj PRX = 2 and PTXj PRX = 10). The logarithmic scale for 
the y-axis has been chosen to better highlight the energy consumption behavior both 
for the low values measured within small networks and for the high values measured 
within large networks. 





M 

(a) m = and PTx/ PRX = 2 . 



M 

(b) m = and PTxJ PRX = 10 . 



Fig. 4. Energy Consumption for m = 2f^, ^ and different PTxf PRX values. 



The numerical results clearly show that the effectiveness of PS-SDP does not depend 
on the number of stations in the network. These results validate the effectiveness of 
the conservative assumption of assigning to E[Enerey ] a fixed weight of 

^ tafiged _collisioritrasm ^ 

0.1 as we have done in step 4 of our transmission control strategy. Eurthermore, by 
comparing the curves obtained with PTXj PRX = 2 and PTXj PRX =10 it is con- 
firmed that the PS-SDP effectiveness is independent of the network interface behav- 
ior. For the PS-SDP we have considered different smoothing factors to investigate the 
impact of the memory-estimate length on the protocol performances. The numerical 
results show that as the memory-estimate length increases (i.e., increasing the a 
values), as closer PS-SDP approaches the theoretical lower bound for the energy con- 
sumption. However, the differences between the energy consumptions measured are 
not meaningful and they cannot be appreciated from the figures. 

In Figures 5. a and 5.b we compare (for an average message length equal to lOOt^,^^,) 
the energy consumption, measured in the standard IEEE 802.11 protocol (STD 
802.11) and the PS-SDP against the theoretical lower bound (OPT 802.11). In figures 
4. a and 4.b we vary the network configuration (i.e., M g [5. ..100] ) and the network 
interface characteristics (i.e., PTXj PRX = 2 and PTXj PRX = \0 ). The discussion 
performed in the case of short messages (m = 2t^,^,) can be repeated in the case of 
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long messages (m = lOOf ^ ), shown in Figures 5. a and 5.b. Hence, we have con- 
firmed that PS-SDP is adaptive to both the number of stations in the network and the 
traffic characteristics. 




Fig. 5. Energy Consumption for m = lOOt and different PTxj PRX values. 



6. Conclusions 

In this paper, by considering the p-persistent IEEE 802. 1 1 protocol we have investi- 
gated the station behavior that guarantees to minimize the energy consumption re- 
quired to successfully transmit a message. By exploiting this analysis we have found 
out that the minimum energy consumption is closely approximated when each station 
adopts a transmission probability that permits to have the average energy spent by 
sensing the channel equal to the average energy wasted during collisions. This prop- 
erty has been used to propose a Power Saving - Simple Dynamic IEEE 802.1 1 Proto- 
col that allows each station to estimate at run-time the p value. Ps-SDP is based on 

^ opt 

simple and low-cost energy-consumption estimates. Our strategy does not require any 
information about the network population, but it is adaptive to the number of stations 
in the network. Furthermore, it is completely distributed so as to fit well the ad hoc 
networking paradigm. We have demonstrated through simulative results that the PS- 
SDP is effective to approach the minimum energy consumption in all the configura- 
tions analyzed. Further research involves the investigation of PS-SDP behavior in 
dynamic conditions related to either bursty arrivals of transmission requirements or 
network topology changes. 
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Appendix A. 



From the analytical results presented in [4], it follows: 



E[Energy,^^ J= PRX ■ E[ldle _p] = PRX ■ 



i-ii-pY 



E\Energy^^ 



^ ] — PRX ■ Collision 



Pv {jagged _ collision I tag _tr{ — 



Ma^c] 






(A.1) 

(A.2) 

(A.3) 



Pr {not _ tagged _ collision I not _ tag _ tr{ 



£[Ac]+ 1 
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(A.5) 



(A.6) 
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Abstract. The fast adoption of IP-based communications for hand-held devices 
equipped with wireless interfaces is creating new challenges for the Internet 
evolution. Users expect flexible access to Internet based services, including not 
only traditional data services but also multimedia applications. This generates a 
new challenge for QoS provision, as it will have to deal with fast mobility of 
terminals being independent of the technology of the access network. Various 
QoS architectures have been defined, but none provides full support for guaran- 
teed service levels for mobile hosts. This paper discusses the problems related 
to providing QoS to mobile hosts and identifies the existing solutions and fu- 
ture work needed. 



1 Introduction 

The emerging wireless access networks and third generation cellular systems consti- 
tute the enabling technology for "always-on" personal devices. IP protocols, tradi- 
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tionally developed by the Internet Engineering Task Force (IETF), have mainly been 
designed for fixed networks. Their behaviour and performance are often affected 
when deployed over wireless networks. 

The telcom world has created various systems for enabling wireless access to the 
Internet. Systems such as the General Packet Radio Service (GPRS), Enhanced Data 
Rate for GSM Evolution (EDGE), Universal Mobile Telecommunications System 
(UMTS) and International Mobile Telecommunications (IMT-2000) are able to carry 
IP packets using a packet switching network parallel to the voice network. These 
architectures use proprietary protocols for traffic management, routing, authorisation 
or accounting, to enumerate some, and are governed by licenses and expensive sys- 
tem costs. 

From the QoS point of view, the problems with mobility in a wireless access net- 
work and mobility-related routing schemes are related to providing the requested 
service even if the mobile node changes its point of attachment to the network. Hand- 
overs between access points, change of IP-addresses, and mechanisms for the intra- 
domain micro mobility mechanisms may create situations where the service assured 
to the mobile node cannot be provided, and a violation of the assured QoS may occur. 
A QoS violation may result from excess delays during handovers, packet losses, or 
even total denial of service. In the case where the user only requested differentiation 
according to a relative priority to flows, a short QoS violation may fit within accept- 
able limits. If the flows were allocated explicit resources, the new network access 
point and route from the domain edge should provide the same resources. 

Several research projects within the academic community, e.g. INSIGNIA 
[LeeOO], and in the industrial community, e.g. ITSUMO [ChenOO], have sought to 
combine mobility with guaranteed QoS. In the BRAIN project [BRAIOO], we are 
envisioning an all IP network, where seamless access to Internet based services is 
provided to users. By using IETF protocols, we are designing a system that would be 
able to deliver high-bandwidth real-time multimedia independent of the wireless 
access network or the wireless technology used to connect the user to Internet. This 
implies the need for IP mobility support and also end-to-end QoS enabled transport. 
The provision of QoS guarantees over heterogeneous wireless networks is a challeng- 
ing issue; especially because over-provisioning is not always possible and the per- 
formance of the wireless link is highly variable. We focus our architecture on wireless 
LAN networks, since these provide high bandwidths but may also create frequent 
handoffs due to fast moving users - this type of architecture is most demanding in 
view of mobility management and QoS. 



2 QoS and Mobility Background 

This sections presents QoS and mobility architectures relevant to the further discus- 
sion. We have not covered all existing architectures in or study but at least those con- 
sidered most important or promising in order to understand completely all the issues 
concerning QoS and mobility interactions. 
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In the following discussion, the term mobile node (MN) is used to refer to a mobile 
host or mobile router. If mobile host (MH) is used, the term mobile router does not 
apply, and vice versa. 

Regarding QoS we have considered lETF-presented architectures for providing 
different levels of services to IP flows, although much work has been done within the 
academic community and the telcom industry; for example INSIGNIA and ITSUMO 
are mature proposals for providing QoS to data flows. INSIGNIA has its own in-band 
signalling mechanism and ITSUMO is based on the DiffServ framework. 

The IETF architectures can be classified into three types according to their funda- 
mental operation; the Integrated Services framework [Wroc97] and the Resource 
Reservation Protocol (RSVP [BZBh- 97]) provides explicit reservations end-to-end; 
the Differentiated Services architecture (DiffServ, [BBCh- 98], [BBGSOl]) offers hop- 
by-hop differentiated treatment of packets. There are a number of ‘work in progress’ 
efforts, which are directed towards these aggregated control models. These include 
aggregation of RSVP [BILDOO], the RSVP DCLASS Object [BeOO] to allow DSCPs 
to be carried in RSVP message objects, and the operation of Integrated Services over 
Differentiated Services networks ([BernOO], [WCOO]) proposed by the Integrated 
Services over Specific Link Layer (ISSLL) Working group. On the application level 
the Real-Time Transport Protocol (RTP, [SCFJ96]) provides mechanisms for flow 
adaptation and control above the transport layer. 

For Mobility Management we have based or study on an analytical method we call 
the Evaluation Framework [EMSOO], which has been adopted for facilitating detailed 
analysis and comparative evaluation of mobility protocols. This framework facilitates 
the selection of the most promising candidates for mobility management and intro- 
duce a categorisation for distinguishing protocols and their associated purposes. This 
analysis is closely related to QoS development, since both mobility and QoS proto- 
cols are expected to have awareness of certain, if not all, of their functionality. 

For the interaction study we have considered several mobility architectures pre- 
sent today. On the macro-mobility side Mobile IP [PerkOO] is the current standard for 
supporting macroscopic mobility in IP networks and its Ipv6 counterpart. Mobile IP 
support in IPv6 [JPOO], based on the experiences gained from the development of 
Mobile IP support in IPv4, and the opportunities provided by the new features of the 
IP version 6 protocol. 

For the support of regional mobility we identified two major categories: Proxy - 
Agent Architectures (PAA) which extend the idea of Mobile IP into a hierarchy of 
Mobility Agents and Localized Enhanced-Routing Schemes (LERS) which introduce a 
new, dynamic Layer 3 routing protocol in a ‘localised’ area. 

In the first group (PAA) examples include the initial Hierarchical Mobile IP 
[Perk97] and its alternatives, which place and interconnect Mobility Agents more 
efficiently: Mobile IP Regional Registration [GJPOl], Transparent Hierarchical Mo- 
bility Agents (THEM A) [MHW-h 99] and Fast Handoff in Mobile IPv4 [ElOl]. The 
new Mobile IP version 6 [JPOO] has had some optional extensions by applying a hier- 
archical model where a border router acts as a proxy Home Agent for the Mobile 
Nodes. They include “Hierarchical MIPv6 mobility management’’ [SCEBOl] and 
“Mobile IPv6 Regional Registrations [MPOl]. 
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In the second group (LERS) there are several distinctive approaches: Per host for- 
warding schemes where soft-state host-specific forwarding entries are installed for 
each MN (HAWAII [RLTh- 99], Cellular IP [CGKh- 00], Cellular Ipv6 [SGCWOO]); 
Multicast-based schemes which make use of multicast protocols for supporting point- 
to-multipoint connections (dense mode multicast-based [SBK95][MB97][TPL99] and 
the recent sparse-mode multicast-based [MSAOO]); and MANET-based schemes 
adapted for mobile ad-hoc networks (MER-TORA [OTCOO] [OTOl]). 

Figure 1 shows some of the many IP mobility protocols, which category they fall 
into and very roughly how they relate to each other. 
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Fig. 1. Classification of mobility protocols 

We will pay special attention to Handover Management, as it is considered one of 
the most important features of the mobility protocols when considering the interaction 
with QoS protocol because of the likely re-negotiation of QoS parameters. Handover 
refers in general to support for terminal mobility wherever the mobile node changes 
its point of attachment to the network. 

We can identify several handover types: A Layer-2 handover happens if the net- 
work layer is not involved in the handover, intra-access networn handover when 
the new point of attachment is in the same access network, inter-access network 



^ Access Network (AN): An IP network, which includes one or more ARs and gateways. 
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handover when the new access router is in a different access network. Horizontal or 
vertical handover are said to happen if the old and the new access routei|]use the 
same or different wireless interface (technology) respectively. 

We can also distinguish three different phases in a handover: the Initiation Phase, 
when the need for a handover (and its initiation) is recognized , the Decision Phase, 
when the best target access router is identified and the corresponding handover is 
triggered, based on measurements on neighbouring radio transmitters and eventual 
network policy information, and the Execution Phase, when the mobile node has 
been detached from the old access router and attached to the new one. 

In a planned handover, contrary to an unplanned handover, some signalling mes- 
sages can be sent before the mobile node is connected to the new access router, e.g. 
building a temporary tunnel from the old access router to the new access router. 

Specific actions may be performed depending on the handover phase. For exam- 
ple, the events may initiate upstream buffering or advance registration procedures at 
the mobile node. These mechanisms characterize furthermore the handover type: 
smooth handover is a handover with minimum packet loss, fast handover allows 
minimum packet delays and seamless handover that is a smooth and fast handover. 



3 Interaction of Mobility and QoS 

This section discusses the problems related to guaranteeing service levels to mobile 
nodes. We classify the problem areas into three groups, namely topology related 
problems (3.1), and macro (3.2) and micro mobility (3.3) related issues. Solutions to 
these problems are presented in Section 4. 



3.1 Depth of Handovers 

We can identify several types of handover situations, which create different amounts 
of control signalling between different entities; handovers within the same Access 
Router (AR), between ARs and between access networks. The same physical hand- 
over can create different logical handover situations to different MN flows if the 
flows use different network gateways. Figure 2 shows a sample network topology to 
Illustrate the levels of handovers while a MN moves within and between two net- 
works. 

The different levels of handovers create variable load of signalling in the access 
network. Also, if the QoS architecture has a signalling mechanism, such as RSVP, it 
adds to the need to signal in certain handover situations. 

If the AR node does not change during a handover, the handover control only 
needs to handle radio resources since the routing paths do not change. 

If the AR changes but the gateway stays the same due to similar routing, the hand- 
over affects the radio resource availability and the access network resources. In addi- 



^ Access Router (AR); An IP router between an Access Network and one or more access links. 
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tion, the new AR may need to check for admission control at the same time. All 
RSVP-reservations need to be refreshed. 

If the gateway changes, either within the same access network or when the MN 
changes networks, flows may experience a drop in their QoS until the QoS signalling 
has updated the nodes on the paths. The time interval during which the MN is not 
receiving the subscribed QoS needs to be minimized. 




Fig. 2. Example network topology regarding different handover scenarios 



3.2 Macro Mobility Issues 

The first macro-mobility problem arises from the triangular routing phenomenon. 
Packets from the MN usually follow a direct path to the CNs, packets from the CNs 
are re-routed via the MN's home network to its point of attachment in a foreign net- 
work, from where they are forwarded to the MN's current location. Several QoS ar- 
chitectures operate best when packets follow the same route in the forward and re- 
verse direction. Triangular routing can affect the service level guarantees of these 
schemes. 

It is possible to tunnel the upstream flow to follow the downstream using Reverse 
Tunnelling [MontOl]. However, routers in the tunnel may not be able to recognize 
some encapsulated parameters of the QoS protocols apart from IP addresses. For ex- 
ample, if RSVP packets use the Router Alert option to indicate to routers on the path 
that they require special handling, when RSVP messages are encapsulated with an 
outer IP header, the Router Alert becomes invisible. Although solutions to this have 
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been proposed e.g. RSVP extensions to mobile hosts [AA97], they still add complex- 
ity to the operation of QoS protocols on mobile environments. 

Other main concern for QoS when the host is moving is the time needed to re- 
establish the routes, and hence, the time needed to re-configure resource management 
required to provided QoS in the new location. Even in Route Optimisation, transmis- 
sion of binding updates directly to CNs result in a large update latency and disruption 
during handover. This effect is greatly increased if MN and HA or CN are separated 
by many hops in a wide area network. Data in transit may be lost until the handover 
completes and a new route to the MN is fully established. Route Optimisation (as a 
protocol specification) however includes Smooth Handoff support using Previous 
Foreign Agent Notification extension, which can be used to avoid the described dis- 
ruption. 

There are other problems related to signalling load and address management. 
Highly mobile MNs create frequent notifications to the home agent, which can con- 
sume a significant portion of wireless link resources. Since the current Mobile IP 
standard requires the mobile to change the care-of address (either FA or co-located) at 
every subnet transition, it is more complex to reserve network resources on an end-to- 
end path between the CN and the mobile. For example, if RSVP is used, new reserva- 
tions over the entire data path must be set up whenever the care-of address changes. 
The impact on the latency for re-establishment of the new routes is critical for QoS 
assurances. 

Mobile IPv6 

Mobile IPv6 makes use of the new features provided by IPv6 protocol. They help to 
solve most of the problems discussed above which arise with the use of Mobile IP in 
IPv4 networks. For example Route Optimisation is included in the protocol, and there 
are mechanisms for movement detection that allow a better performance during 
handover. The Routing Header avoids the use of encapsulation, reducing overhead 
and facilitating, for example, QoS provision. 

Although the Mobile IPv6 solution meets the goals of operational transparency 
and handover support, it is not optimised for managing seamless mobility in large 
cellular networks. Large numbers of location update messages are very likely to oc- 
cur, and the latency involved in communicating these update messages to remote 
nodes make it unsuitable for supporting real-time applications on the Internet. These 
problems indicate the need for a new, more scalable architecture with support for 
uninterrupted operation of real-time applications. 



3.3 Micro mobility Issues 

The domain internal micro mobility schemes may use different tunnelling mecha- 
nisms, multicast or adaptable routing algorithms. The domain internal movement of 
MNs affects different QoS architectures in different way. IntServ stores a state in each 
router; thus a moving mobile triggers local repair of routing and resource reservation 
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within the network. DiffServ on the other hand has no signalling mechanism, which 
means that no state needs to be updated within the network, but the offered service 
level may vary. At least the following design decisions of a micro mobility protocol 
need to be considered when combining mobility and QoS architectures within a net- 
work: 

• the use of tunnelling hides the original packet information and hinders Multi- 
Field classification, 

• changing the MN care-of-address during the lifetime of a connection, 

• multicasting packets to several access routers consumes resources, 

• having a fixed route to the outer network (always through the same gateway) is 
less scalable, 

• adaptability and techniques (speed and reliability) to changing routing paths, 

• having an optimal routing path from the gateway to the access router and 

• support for QoS routing. 

Multicast approaches can have ill effects on the resource availability, for example, 
because the multicast group can vary very dynamically. The required resources for 
assured packet forwarding might change rapidly inside the domain, triggering differ- 
ent QoS-related control signalling and resource reservations. 

The use of tunnelling can affect the forwarding of QoS-sensitive flows since the 
original IP-packet is encapsulated within another IP-packet. However, as long as the 
tunnel end-points are capable of provisioning resources for the tunnelled traffic flows, 
the agreed QoS level need not be violated. Tunnelling has the advantage that multi- 
ple traffic flows can be aggregated onto a single reservation, and there is inherent 
support for QoS routing. Micro-mobility schemes that rely on explicit per-host for- 
warding information do not have such simple support for QoS routing, because there 
is only one possible route per host. Both IntServ and DiffServ have been extended to 
cope with tunnelling ([TKWZOO], [BlaOO]) and the changes to the IP-address 
([MHOO]). Some coupling of the macro and micro mobility protocols and the QoS 
architecture may still be needed to ensure an effective total architecture. 



4 Solutions 

This section identifies various schemes for providing parts of an all-inclusive sup- 
port of QoS-aware mobility. A full support of mobile terminals with QoS require- 
ments can be accomplished by a combination of these schemes. 



4.1 Strict Shaping at Network Edges 

Network operators already intercept each packet arriving from an external network 
and decide whether the packet can be allowed into the core network. This admission 
control is performed by a node called the firewall and is based on IP addresses and 
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port numbers e.g. identifying applications. Firewalls are typically deployed for secu- 
rity reasons and usually scan both incoming and outgoing packets. 

The firewall operation can be modified by using different rules for performing the 
admission control. Instead of just preventing known security problems, the edge 
nodes would use defined bandwidth and QoS policies on a per-flow basis for control- 
ling the traffic admitted into the network. Both the access routers and the gateways 
perform the admission control, the former for flows originating from mobile nodes 
and the latter for flows emerging from external networks. 

When a previously unknown packet arrives, the edge node will check for the Ser- 
vice Level Agreement (SLA) and policies stored for the particular MN being con- 
tacted. A central bandwidth broker is in charge of the policy management, and once it 
receives a request from an edge node, it checks its databases for the proper forward- 
ing rules and returns them to the edge node. Adjusting the load created by best-effort 
traffic is vital. 

This method can be used to adjust the load admitted into each service class, if the 
network is operating with aggregate service classes, and not per-flow, as with RSVP. 
This can decrease the network load and thus allow for smoother handovers, especially 
if the traffic belonging to the best-effort class is not consuming all leftover capacity. 
Therefore, there is enough bandwidth left to support moving terminals. 

The access routers should not need to make the primary policing decisions when 
the arriving load exceeds the capacity of the forward link. If we allow downlink traf- 
fic to flood the access network, mobility management schemes are affected. A band- 
width broker could be used to co-ordinate the access network resources and configure 
the gateways to drop excess traffic. 



4.2 Coupling of Micro-mobility and QoS 

In order to improve the behaviour of reservation-based QoS, as defined in the Inte- 
grated Services architecture [BCS94], in the dynamic micro-mobile environment, the 
QoS and micro-mobility mechanisms can be coupled to ensure that reservations are 
installed as soon as possible after a mobility event such as handover. Reservations are 
installed using a QoS signalling protocol, the most widely adopted of which is RSVP, 
which will be used in the following discussions as an example of an out-of-band soft 
state mechanism. In this study we present three levels of coupling over three differ- 
ent micro-mobility architectures: proxy agent architectures 

[CB00][GJP01][MS00b][MP01], MANET-based schemes [OTCOO] and per-host 
forwarding schemes [SGCW00][RLTh- 99][KMTV00]. The three scales of coupling 
presented for consideration are described on the following sections. 

4.2.1 De-coupled 



In the de-coupled option, the QoS and micro-mobility mechanisms operate independ- 
ently of each other and the QoS implementation is not dependent on a particular mo- 
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bility mechanism. Changes in network topology are handled by the soft-state nature 
of the reservations. 

After a mobility event, the QoS for the traffic stream will be disrupted while until a 
new reservation is installed via refresh messages between the node where the old 
route and new route intersect, known as the crossover router (figure 3), to the new 
access router (NAR). The reservation between the crossover router and the old access 
router (OAR) cannot be explicitly removed, and must be left to timeout, which is not 
the most efficient use of network resources. This will occur every time the MN moves 
AR, which may be many times during one RSVP session, and can lead to poor overall 
QoS for an application. 

These problems are common to all micro-mobility schemes. 




Fig. 3. Concept of a crossover router 



4.2.2 Loosely Coupled 

The loosely coupled approach uses mobility events to trigger the generation of RSVP 
messages, which distribute the QoS information along new paths across the network. 
The RSVP messages can be triggered as soon as the new routing information has 
been installed in the network. This mechanism is the Local Path Repair option, and is 
outlined in the RSVP specification [BZB-H97] and has the effect of minimising the 
disruption to the application’s traffic streams because there is a potentially shorter 
delay between handover and reservation set up. It also avoids the problem of trying 
to install a reservation across the network before the routing update information has 
been propagated. The latency for installing the reservation can also be reduced by 
localising the installation to the area of the network affected by the change in topol- 
ogy, i.e. between the crossover router and the NAR. The areas of the network af- 
fected by the topology change can have reservations installed across them almost 
immediately, instead of having to wait for the update to travel end-to-end, or for the 
correspondent node to generate a refresh message for reservations to the MN. In the 
case where the QoS must be re-negotiated, however, end-to-end signalling is re- 
quired. The old reservation should be explicitly removed, freeing up unused re- 
sources immediately. 

However, the loosely coupled approach requires additional complexity within the 
inter-mediate network nodes to support the interception and generation of RSVP 
messages when the router is acting as the crossover node. Another disadvantage is 
that bursts of RSVP signalling messages are generated after handover to install multi- 
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pie reservations. This does not happen in the de-coupled case, because the reserva- 
tion signalling messages are generated when refresh timers expire, not by the same 
triggering event. 

In the proxy agent architectures the loosely coupled approach ensures that the 
reservation is not installed until the registration information generated by the MN has 
propagated across the network. In MANET based schemes and the per-host for- 
warding schemes, the loosely coupled ensures that the new routing information has 
been distributed into the network before attempting to install the reservation. The 
reservation is installed in the network as soon as the route to the MN is stable without 
having to wait until the next timeout to send QoS messages. 

4.2.3 Closely Coupled 

The closely coupled uses the same signalling mechanism to propagate the mobility 
and QoS information, either as an extension to the QoS/MM signalling protocol or via 
a unique QoS-routing protocol. This approach minimizes the disruption to traffic 
streams after handover by ensuring that the reservation is in place as soon as possible 
after handover by installing routing and QoS information simultaneously in a local- 
ised area. It also provides a means to install multiple reservations using one signalling 
message. The reservation along the old path can also be explicitly removed. 

In the proxy agent architectures, support for the opaque transport of QoS infor- 
mation in the registration messages is provided, and is interpreted by the mobility 
agents. This allows the MN to choose a mobility agent based on the available re- 
sources and provides a degree of traffic engineering within the network. In the 
MANET-based and per-host forwarding schemes, the messages that install the 
host-specific routing information in the network also transparently carry opaque QoS 
information. The reservations are installed at the same time as the routing informa- 
tion, minimizing the disruption to the traffic flows. 

4.2.4 Comparison of Approaches 

Coupling reservations with micro-mobility mechanisms allow reservation set up de- 
lays to be minimised and packet loss reduced. Reservations along the new path can be 
installed faster because QoS messages can be generated as soon as the new route is 
established, reducing the disruption to the data flows. Also scalability and overhead 
are improved because a minor number of update messages are sent or they are local- 
ised to only the affected areas of the network. Moreover, it ensures that the request 
for a QoS reservation only occurs when there are valid routes to the MN in the net- 
work. 

The closely coupled approach requires support from particular micro-mobility 
mechanisms so that the opaque QoS information can be conveyed across the network. 
This has the consequence that the QoS implementation will be specific to a particular 
micro-mobility mechanism, and extensions to the micro-mobility protocol may be 
needed to support the required functionality. However, the closely coupled approach 
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maintains consistency between the reservation and the routing information within the 
network, and can reduce the amount of signalling required to set-up multiple reserva- 
tions. 

The choice between whether to use the loosely coupled approach or the closely 
coupled approach is a trade-off between a QoS solution that is tied to a micro- 
mohility protocol and the performance advantage close coupling provides. The 
closely coupled approach potentially provides improvements in performance and 
efficiency, hut at the expense of additional complexity and loss of independence from 
the underlying micro-mobility mechanism. 



4.3 Advance Reservations 

The mobile host may experience wide variations of quality of service due to mobility. 
When a mobile host performs a handover, the AR in the new cell must take responsi- 
bility for allocating sufficient resources in the cell to maintain the QoS requested (if 
any) by the node. If sufficient resources are not allocated, the QoS needs may not be 
met, which in turn may result in premature termination of connections. 

It is clear that when a node requests some QoS it is requesting it for the entire con- 
nection time, regardless of whether it is suffering handoffs or not. The currently pro- 
posed reservation protocol in the Internet, RSVP, implements so-called immediate 
reservations, which are requested and granted just when the resources are actually 
needed. This method is not adequate to make guaranteed reservations for mobile 
hosts. To obtain mobility independent service guarantees a mobile host needs to make 
advance resource reservations at the multiple locations it may possibly visit during 
the lifetime of the connection. 

There are a number of proposals for advanced reservations in the Internet Commu- 
nity that can he classified into two groups, depending on the techniques they use: 

• Admission control priority 

• Explicit advanced reservation signalling 

Those groups are not necessarily distinct, as both approaches could be used to- 
gether. Admission control strategies are transparent to the mechanism using explicit 
advanced reservations, other than when a request is rejected. 

4.3.1 Admission Control Priority 

It is widely accepted that a wireless network must give higher priority to a handover 
connection request than to new connection requests. Terminating an established con- 
nection from a node that has just arrived to the cell is less desirable than rejecting a 
new connection request. Admission control priority based mechanisms rely on this 
topic to provide priorities on the admission control to handover requests without 
significantly affecting new connection requests. 
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The basic idea of these admission control strategies is to reserve resources in each 
cell to deal with future handover requests. The key here is to effectively calculate the 
amount of bandwidth to be reserved based on the effective bandwidth [EM93] of all 
active connections in a cell and the effective bandwidth of a new connection request. 

There are a number of different strategies to do this: 

• Fixed strategy. One simple strategy is to reserve a fixed percentage of the AR's 
capacity for handover connections. If this percentage is high, adequate capacity 
will most likely be available to maintain the QoS needs of handover connections, 
but at the expense of rejecting new connections. 

• Statie Strategy: the threshold values are based on the effective bandwidths of the 
connection requests. There is a fraction of bandwidth reserved for each of the 
possibly traffic class. This fraction may be calculated from historic traffic infor- 
mation available to the AR. 

• Dynamie Strategy: each AR dynamically adapts the capacity reserved for dealing 
with handover requests based on connections in the neighbouring cells. This will 
enable the AR to approximately reserve the actual amount of resources needed 
for handover requests and thereby accept more new connection requests as com- 
pared to in a fixed scheme. Such dynamic strategies are proposed and evaluated 
in [NS96] and [YL97]. 

• Advanced Dynamic Strategy: this strategy assumes an analytical model where 
handover requests may differ in the amount of resources they need to meet their 
QoS requirements, and therefore it is more suitable for multimedia applications. 
A proposal for this strategy is described in [RSAK99]. 

This kind of admission control strategy can be used on statistically access control 
as the one performed on non hard guaranteed QoS provision, such as some DiffServ 
PHBs or Controlled Load on IntServ model. It is not enough for hard guarantees in all 
paths followed by a mobile node. 

4.3.2 Explicit Advanced Signalling 

Admission Control strategies are not enough to accommodate both mobile hosts that 
can tolerate variations in QoS and also those that want mobility independent service 
guarantees in the same network. To obtain good service guarantees in a mobile envi- 
ronment, the mobile host makes resource reservations at all the locations it may visit 
during the lifetime of the connection. These are known as advanced reservations. 

There are a number of different approaches for advanced reservation in the litera- 
ture. We present here two of the most relevant for supporting Integrated Services 
(MRSVP [TBA98]) and other for supporting Differentiated Services (ITSUMO ap- 
proach [ChenOO]). 

MRSVP 

Mobile RSVP introduces three service classes to which a mobile user may subscribe: 
Mobility Independent Guarantees (MIG) in which a mobile user will receive guaran- 
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teed service. Mobility Independent Predictive (MIP) in which the service received is 
predictive and Mobility Dependent Predictive (MDP) in which the service is predic- 
tive with high probahility. 




► Passive reservation (depending on mobility spec) 

Fig. 4. MRSVP advanced reservations. 

MRSVP allows the mobile node to make advance resource reservation along the 
data flow paths to and from the locations it may visit during the lifetime of the con- 
nection. These are specified in the Mobility Specification (MSPEC) as shown in fig- 
ure 4. The advance determination of the set of locations to be visited by a mobile 
node is an important research problem, although several mechanisms have been pro- 
posed to approximately determine them by the network. 

Two types of reservations are supported in MRSVP: active and passive. A mobile 
sender makes an active reservation from its current location and it makes passive 
reservations from the other locations in its MSPEC. To improve the utilization of the 
links, bandwidth of passive reservations of a flow can be used by other flows requir- 
ing weaker QoS guarantees or best effort service. However, when a passive reserva- 
tion becomes active (i.e. when the flow of the mobile node who made the passive 
reservation moves into that link), these flows may be affected. 

ITSUMO Approach 

The ITSUMO approach has a different philosophy on advanced reservations. Al- 
though the mobile node itself has to explicitly request a reservation and specify a 
mobility profile, the advanced reservation is ‘made’ by Global QoS Server (GQS) on 
its behalf. Based on the local information and the mobility pattern maybe negotiated 
in the SLS, the QGS envisions how much bandwidth should be reserved in each QLN 
(QoS Local Node). The QGS then updates periodically the QLNs likely to be visited 
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by MN. Rather than actively reserving resources in each of the access points, this 
scheme it is likely that either a passive reservation (utilized for best effort traffic) or 
an "handover guard band" could be used. 

The clear difference with the previous approach is that advanced reservation in 
MRSVP has to be signalled by the mobile node explicitly to every station according 
to its mobility pattern. This mobility pattern is known and processed by it. In the 
ITSUMO approach this information is updated periodically by the QGS, according to 
the mobility pattern informed by the MN but processed on the QGS. So it could be 
said that MN relies the explicit advanced reservation in the QGS (figure 5). 




Advanced reservation by QGS 



Fig. 5. ITSUMO advanced reservations 



4.4 Pre-handover Negotiations 

Pre-handover negotiations associate the change to a new cell to the actual resource 
availability in the new cell, as opposed to advance reservation schemes. When the 
network or the mobile node deems that a handover should occur, the access router can 
request some indication of resource availability from neighbouring access routers. 

This needs support from the convergence layer between the IP-layer and the link 
layer. The link layer would need to communicate the overall resource availability of 
an access point in order to let the IP-layer to make a decision about a possible hand- 
over. Also an indication of a forthcoming handover is needed. 

Initially, context transfer would enhance handovers between access routers, allow- 
ing access routers to communicate directly or through the MN, the QoS and other 
contexts of a moving MN. A further refinement to the scheme would allow both ac- 
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cess router and gateways to communicate the mobile’s context during a handover. 
This would allow to reduce the time during which the mobile has no specific re- 
sources allocated to it 



4.5 Solutions in Third Generation Mobile Communication Systems 

The currently evolving design of the third generation mobile communication systems 
(3G systems) aims to provide real-time multimedia services in wide area cellular 
networks [Walk99]. These systems will include a packet switched backbone (PSB) to 
carry data traffic in the form of IP datagrams, in addition to the traditional circuit 
switching for voice calls. As the standardization of 3G systems evolves, more and 
more IETF protocols are incorporated into the architecture. UMTS Release 2000 
considers the PSB as an IP backbone using the same protocols as IP fixed networks, 
while the Radio Access Network (RAN) will use proprietary protocols. For the IP- 
based data transmission, this RAN is seen as a link layer. 

Mobility management and the provision of QoS in 3G systems are still different 
from IP based fixed networks. Three types of mobility are considered in 3G systems: 
terminal, personal and service mobility. Service mobility provides the same set of 
services regardless of the current point of attachment to the 3G network. Personal 
mobility allows users to receive their personalized service independent of their loca- 
tion in the network. Terminal mobility across different operators is a key requirement 
in 3G systems. To this end, the support of Mobile IP is being considered with some 
proposed extensions [DasOO]. In essence, the Internet Gateway Serving Node (IGSN) 
will act as Foreign Agent supporting macro mobility, while the movements of the 
terminal inside the Universal Terrestrial Radio Access (UTRA) are not visible outside 
the 3G network. The provision of QoS in 3G systems will incorporate two new fea- 
tures with respect to 2G systems and their evolutions: support for user/application 
negotiation of UMTS bearer characteristics and standardized mapping from UMTS 
bearer services to core network QoS mechanisms. 



5 Conclusion 

In this paper we discussed problems related to mobility and QoS. We deduced that the 
main problem in this field is following the movement of the mobile host fast enough 
to minimize the disruption caused to the QoS received by the application traffic 
flows. Also the depth of the handover signalling and the related QoS control affect 
the service outcome. 

In Section 4 we studied solutions for the interoperability of mobility and QoS. We 
presented several schemes that provide parts of a total solution to mobile QoS. We 
discussed performing strict flow shaping at the network edge, coupling of micro- 
mobility and QoS protocols, advanced reservations, pre-handover negotiations and 
context transfer, and the 3G approaches. 
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It has become apparent that even though there exists several good partial solutions, 
we still need adaptive applications. Handovers, for example, still cause some distur- 
bance to data streams. RTP can provide to this adaptability. The whole notion of end- 
to-end QoS still seems very distant. It is possible to provide adequate service to mo- 
bile hosts in a private access network, but when the corresponding node is behind 
some wider public network, keeping the promised QoS becomes harder. 

A new IETF Working Group, Seamoby, is aiming to provide seamless mobility 
across access routers and even domains. The work of this group will hopefully lead to 
better mobility support, especially for the problematic multimedia streams. Part of the 
work done is on context transfer issues. 
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Abstract. This paper introduces a new methodology for analyzing and 
interpreting QoS values collected by active measurement and associated 
with an a priori constructive model. The originality of this solution is 
that we start with observed performance (or QoS) measures and derive 
inputs that have lead to these observations. 

This process is illustrated in the context of the modeling of the loss 
observed in an Internet path. 

It provides a powerful solution to address the complex problem of QoS 
estimation and network modeling. 



1 Introduction 

Much research effort has been spent during the last decade on modeling and 
analyzing the performance of IP networks. They have contributed to the design 
of mechanisms aiming to enforce a level of Quality of Service. However, work fo- 
cusing on the measurement and prediction of the QoS as offered in real Internet 
is quite new. Modeling the QoS in real Internet is a complex activity as QoS is 
strongly related to the traffic offered to the network. As traffic is highly fluctu- 
ating because of stochastic variations in the number of users and their demands, 
the relation between QoS and traffic is not straightforward, and stochastic mod- 
eling should be applied. Two main classes of approach have been used to address 
this problem : the constructive approach and the descriptive approach. 

The constructive approach has been widely used since many years to model 
systems in general. It is based on the derivation of a model that ideally pro- 
duces the same output than the system for an outside observer. These models 
provide a description of network elements that is as close as possible to the 
real network. The network is described as a combination of queues and routers 
etc., and a scenario defining the parameters of the system in term of arrival 
process, capacity, buffer space, etc. The modeling phase is followed by the res- 
olution phase that can either relies on a simulation or an analytical analysis to 
derive the QoS metrics for a given set of parameters. This approach is widely 
adopted in performance analysis and queueing theory. The generalization of the 
use of ns m as a simulation tool for complex networks had made possible very 
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precise and detailed modeling of network elements and their analysis with this 
approach. Constructive approach has the nice property of relating directly the 
QoS as perceived by end users to operational traffic engineering parameters that 
can be controlled by network operators. It can also answer “what if’ questions, 
arising when one want to evaluate the impact of changes in network parameters 
or architecture in the performance of the system under study. Nevertheless, this 
approach suffers from a main drawback, the assumption put on the structure of 
the network and on the scenario are so strong that it is very unlikely to gener- 
alize results of this approach to the real Internet. This comment restrict greatly 
the field of application of constructive approach for modeling and evaluating the 
QoS in the Internet. 

On the other hand, the descriptive approach is based on measurements of QoS 
in operational networks. It models the QoS as seen in real world by describing it 
by some statistical parameters such as moment of different order (mean, variance, 
autocorrelation, Hurst parameter, etc.). In this approach the network is seen as 
an opaque black box without any access to its internal structure. The descriptive 
approach only describe the QoS measured without trying to explain the mech- 
anism generating the observations. This process mainly aims toward predicting 
the QoS experienced by applications under some reproducibility or stationarity 
assumptions. However, as these models do not integrate the mechanism gener- 
ating the observation, they cannot help on predicting what will happen if these 
stationarity assumptions do not hold anymore, because of change in network 
architecture or more simply because of change in traffic parameters. This means 
that this approach cannot be used for network dimensioning, capacity planning 
or predicting the QoS improvement consecutive to changes in network param- 
eters. This also means that it is not possible to interpret active measurements 
results by means of traffic engineering parameters. These remarks clearly narrow 
the application of this approach to situations where the stationarity assumptions 
are valid. 

In this paper, we try to conciliate these two classes of approaches. We propose 
a methodology for modeling QoS as measured in real networks, based on an a 
priori model structure. In this new approach we first choose an a priori model 
structure based on a constructive approach and we suppose that the parameters 
of this model are unknown. Then, descriptive methods are used to calibrate the 
unknown parameters to QoS as measured on a real network. We will show that 
this methodology can help alleviating some of the concerns expressed before 
about constructive and descriptive approaches. 

In section El active and passive measurements are introduced and discussed. 
Then, in sectionElthe new modeling framework is described. Hidden parameters 
estimation and the Expectation-Maximization (EM) method are presented and 
finally an application of the methodology for modeling packet losses is provided. 
At last we will conclude. 
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2 Measurement of QoS over Internet 



Surveys campaign over Internet have been widely carried during the past years 
|1 210121 1 . Globally two general classes of measurement were applied : active and 
passive measurement. In this section we will describe the specifics of each classes 
and introduce measurements needed for QoS estimation and modeling. 



2.1 Passive Measurements 

Passive measurements are by essence non-intrusive. In this class of measure- 
ments, traffic parameters are monitored in a particular point of the network 
such as a router or a Point of Presence (POP). This monitoring can be done at 
the microscopic level, by analyzing the traffic at packet level jS], as well as at 
the macroscopic level, where aggregate metrics as traffic per ffow, or through- 
put, are measured. Nevertheless, passive measurement are hardly applicable for 
end-to-end QoS prediction as they remain local to the point of measurement. 

Microscopic passive measurement proceed by storing the header of each ob- 
served packet in the monitored access point. This class of measurement generate 
huge volume of data and should be processed off-line. Packet header are used 
for reconstructing all flows crossing the monitored point at each instant of time. 
These measurements lead to very valuable information about the arrival dis- 
tribution and the dynamic behavior of applicative flows in the network. These 
results are essential for constructive analysis as they provide realistic scenarii for 
it. 

Modeling is important in the field of microscopic passive measurement as 
it can be seen as an “Occam’s razor”, describing in a compressed and concise 
manner a huge measurement trace. 

Macroscopic passive measurements have been standardized by the RTFM 
group of IETF [^, and are frequently used in network management. These mea- 
surements usually monitor the overall traffic over an aggregation time scale or 
the traffic per flow crossing the monitoring point. However, it is clear that inter- 
preting the measurements need a priori about the measured phenomena, that is 
provided by modeling. More specifically, the interpretation of macroscopic pas- 
sive measurement is closely related to the scale of aggregation. This problem is 
illustrated by the different definitions provided for a flow in the literature HS|. 
This means that the macroscopic passive measurement can be considered by the 
methodology developed in this paper for QoS modeling and interpretation. 

2.2 Active Measurements 

Active measurement are more intrusive, as they inject traffic into the Internet. 
The rationale behind active measurement is that estimating the end to end QoS 
as sensed by real application can only be done by putting oneself in place of the 
real application. In this approach a probe sending process injects probe packets 
into the network. At the other end of the network a measurement agent records 
some metrics on each received probe packet. The collected metrics are used to 
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infer about the QoS that will be seen by other packet flow crossing the network. 
The IPPM group of the IETF has defined different end-to-end performance met- 
rics mi to be collected on probe packets. Three main type of information are 
extracted from the received probe packet flow: packet size, packet loss process 
and packet delay process, they are used to derive more complex metrics as good- 
put, loss rate, loss run length, jitter, etc. The probe packets are usually sent using 
ICMP (in ping surveys) or UDP. ICMP probing is more difficult now because of 
the issue related to the Ping of the Death attack. A lot of active measurement 
surveys has been produced during the recent years I9I2I4I2I2()I21I8I and some 
measurement infrastructures have been deployed 1101191 . 

Due to non-synchronized clocks between receivers and senders, the reliable 
measurement of packet delay is difficult. Strict synchronization of two entities 
connected by a varying delay link, can prove to be impossible without access to 
an external universal time reference as provided by a GPS (Global Positioning 
System) time reference jSj- In |E], complex mechanisms that converge asymptot- 
ically to the synchronization of two clocks are developed. However, GPS acqui- 
sition cards are now more and more used making delay with a resolution around 
1 fisec feasible. 

Active measurements are the source of some interesting and challenging prob- 
lems. The first problem is related to the effect of measurement probe traffic on 
the network state. Probe traffic itself load the network and alter the QoS of- 
fered by the network. Precise QoS measurements using active probing needs 
compensation for the effect of measurement traffic. Derivation of a compensa- 
tion methodology need a priori models describing the interaction between the 
measurement traffic and the background traffic. In actual practice the volume 
of active measurement traffic is chosen so low that its effect on network traf- 
fic can be neglected. This approach is not always viable as attaining a specific 
confidence level in the estimation of a QoS parameter may need higher volume 
of probing traffic. We will describe this problem in section 0 Another crucial 
problem is the estimation procedure. This issue will address in section 0 The 
solution to these crucial questions should motivate an Active QoS measurement 
methodology that will contain the measurement procedure (measurement traffic 
inter-arrival pdf, probe traffic rate, etc.) as well as the estimation procedure and 
confidence level (QoS estimator, compensation procedure, etc). 

We will not deal with this important and hard problem in this paper. Our 
less ambitious objective is to present a framework for analysis and modeling 
the QoS as obtained by active measurement techniques. This framework is also 
applicable to the calibration of active measurement as well as to a wide spectrum 
of modeling problem in networking. 



3 Estimation and Prediction of QoS Based 
on Active Measurement 

In the sequel we will suppose that measurement probe packets are not fragmented 
during their journey in the network, and we will also assume that packets are 
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either dropped somewhere in the network or received at the other end of the 
network errorless after a transit delay. These assumptions are reasonable if probe 
packet size are not too large and if we remain on the wired Internet. One can 
formalize the effect of network traffic on every packet of an application in the 
following general framework : the network effect can be modelled as a valve 
followed by a random delay element (Fig.^J). The valve is controlled by an on/off 
time continuous stochastic process S{t) representing the fate (being dropped or 
passing) of a packet sent a time t and its traversal delay is represented by the time 
continuous stochastic process D{t). Now suppose that an application generates 
packet i at time Ti. This specific packet will experience the valve at state S{Ti) 
and a delay D{Ti) . These two processes are representative of the quality of service 
offered by the network to any application. 



S{t) 




Fig. 1. Formal model of the network. 



This formal model is quite general as the two processes S{t) and D{t) are 
not supposed to have any specific properties, modeling the Quality Of Service in 
the network will consist of specifying models and properties for these processes. 

In the context of QoS evaluation based on active measurement, each active 
probe packet provide a sample 5'(fy) of the on/off process (the sampled loss pro- 
cess), and packets that succeed in crossing the network provide also a sample of 
the delay process D(Ti). The descriptive approach to QoS evaluation attempts 
to estimate statistical characteristics of the continuous time processes S{t) and 
D{t) based on the discrete time sampled process S'(fy) and D{Ti). This frame- 
work is the general framework of measurement based modeling of Quality Of 
Service in the Internet. It is not possible to go further in the analysis without 
specifying some assumptions on these processes. However, to be able to inter- 
pret the observed QoS, we need to relate the two processes S(t) and D{t) to 
the traffic inside the network. Our proposed methodological approach consist in 
defining a priori classes of models relating the traffic into the network to these 
processes. We will first provide some details about the descriptive approach. In 
the sequel, we will focus on the descriptive analysis of the on/off process S{t). 

In the descriptive approach, we want to estimate statistical characteristics of 
the process S{t) based on the discrete time sampled process 5’(fy) and D{Ti). As 
a first analysis, we state the assumption that the processes S{t) and D{t) have 
reached a stable and stationary state. In most cases, strict sense stationarity is 
not needed and finite order wide sense stationarity is sufficient, meaning that fc’th 
order statistical moment of the processes S(t) and D(t) and unchanged under 
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translation. The stationary moments are obviously function of the competing 
Internet traffic as well as the measurement probe traffic. 

We assume that active measurement has resulted in a loss trace containing 
K samples {5'(Ti)}, i = where the samples time Ti, i = 1, K 

are chosen following an iid renewal process with lifetime distribution F(t) = 
Prob{Tk+i —Tk < t}. Under ergodic and stationary assumptions for S{t), it is 
possible to estimate the statistical characteristics of S{t), based on the samples. 
For example, the temporal mean S = Sti is an unbiased estimator for 

the mean of S{t) {fi = U{5'(T)}) which is also the loss probability. The variance 
of this estimator will be var{5} = i?(r)dF(r), where R{t) = E{{S{T + t) — 

n){S{T) — /x)} is the autocorrelation of S{t). 

If delay samples are also available, one can extend the above model to take it 
into account. The idea is to use the delay information to reduce the estimation 
variance on the loss probability. However, recent empirical studies have shown 
that delay and loss rates are statistically independent m This is mainly due to 
the fact that losses and delays do not occur at the same location in the network. 
Based on this empirical observation, the joint estimation problem of statiscti- 
cal parameters of S{t) and D(t) can be slipt into two independent estimation 
problems, one for the loss process and the other for the delay. 

Estimations done the the probing traffic can be extended to other competing 
flows. This can be seen by the previous analysis : all competing flows are gov- 
erned by the same open/close process, therefore under the stationary assumption 
for S{t), the temporal mean estimator of all flows will indeed converge to the 
same value. However the variance of this estimator will largely depend on the 
autocorrelation function of the open/close process and on the dynamics of the 
particular flow. For example, TCP flows that send a bulk of packets on window 
opening will undergo higher estimate variance than competing UDP flows that 
send packets more regularly. This comment does not mean that a UDP flow will 
see lower loss rates than a TCP flow, it only says that if a UDP and a TCP flow 
are competing, the TCP flow may see a larger fluctuation of its loss rate than 
the more regularly spaced UDP flow. 

This conclusion might be non-intuitive. It should be stressed on the assump- 
tion made for deriving the results. We have assumed that the processes S{t) and 
D{t) have reached a stable and stationary state. This state will depend on the 
background traffic as much as on the application (or the measurement probing) 
traffic. The reached stationary state might be different if a TCP traffic is sent in- 
stead of a UDP one (and vice versa). This reinstate the intuition that a reactive 
TCP flow should see a lower loss rate than a non reactive UDP one. 

This comment emphasizes the importance of using a model relating the traffic 
into the network to the stationary state of S{t) and D{t). In the following we 
will deal with this problem. As stated before, it is not possible to go further than 
a simple descriptive analysis without any a priori on the process generating the 
on/off and the delay processes. This is the place where constructive analysis 
should be introduced into the analysis. 
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The relationship between the traffic flowing into the network and the two 
process S{t) and D{t) can be described by a constructive model deflning precisely 
(or even roughly) the interactions between different flows resulting in the QoS 
as observed at the output of the network. The estimation goal is to choose some 
undetermined parameter of the constructive model such that an optimisation 
criteria (Maximum Likelihood or Least Mean Squared of error) is verified. By 
this way, the best set of parameters describing the measured QoS under the 
assumption of the a priori constructing model is derived. 

It is clear that all choice of the a priori constructive model are not equivalent. 
Too simplistic a priori model, will not be able to faithfully describe the measured 
QoS. At the other extreme too sophisticated a priori models are untractable or 
too complex to calibrate. Nevertheless, as we are in a stochastic context no model 
will be perfect. To address this issue, we use a paradigm which is borrowed from 
the held of data compression. In data compression, a dataset is represented by 
a predictive model and a sequence of error terms. Each term in the dataset is 
divided in two parts, one obtained by the predictive model that can be discarded 
and an error term which is stored as is. A stronger model leads to smaller error 
terms and less bit needed to encode them. We can use a similar paradigm for 
our a priori Measurement based modeling of Quality Of Service. The measured 
QoS is a dataset that is to be represented by the predictive a priori model 
and a sequence of error terms that will be represented by a strictly descriptive 
model. This paradigm can be extended in a hierarchical way if error terms are 
themselves represented by a new level of a priori model followed by a simpler 
and smaller error term descriptive model. With this paradigm a good a priori 
model is a model that leads to a tractable parameters calibration and at the 
same time to a small and simple descriptive model for the error term. 

The proposed methodological approach resumes to the following steps in 
modeling : first an a priori constructive model is chosen with some input param- 
eters remaining undetermined. Second, measured QoS derived by active mea- 
surement is applied and the unknown parameters of the constructive model are 
estimated by some criteria optimization procedure. We will see a concrete ap- 
plication of this approach in section 

The main characteristic and difficulty of this approach is related to the hidden 
variable. The unknown parameters of the constructive model are not directly 
observed at the output of the network. The estimation procedure tries to estimate 
the most faithful value of the unknown (and unobservable) parameters. This 
estimation procedure is described in the next section (Sec. EJ. 

Hidden variable approach is not so non-intuitive. Reference to the network 
state is not so uncommon in the context of QoS evaluation in networking. How- 
ever, network state is not a concrete and well defined notion. In fact, network 
state is an abstract variable, representative of the effect of all concurrent flows 
on one application flow. As applications have no direct access to information on 
router loads and characteristics, the network state is a hidden variable, that can 
be perceived by an application only through its effects on its data flow. 
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4 Hidden Variable Estimation 

and the Expectation-Maximisation Algorithm 

The general statistical framework of the proposed methodology can be easily 
described. Lets X = a sequence of samples of the two processes S{t) 

and/or D(t) at time {Ti}- Now let represents the a priori model with un- 
known parameters 6. This a priori constructive model will relate the observed 
QoS X to some unobserved parameters (for example input traffic or network pa- 
rameters), represented by 0, by the way of a random function (X = iF{Ai{9))). 
Now, the objective is to find the set of parameters 9, such that an optimization 
criterion is satisfied. One of the most powerful and frequently used criterion is 
the Maximum Likelihood. It chooses the parameters 9 that maximize the prob- 
ability of seeing the observed QoS samples x given that the a priori model M 
follows the parameters 9. 

9 = ArgmaxProbg{.7^(M(0)) = x} (1) 

& 

The estimation procedure then reduces to an optimization problem with a cost 
function. 

The presented framework is very general and may be difficult to manage 
by analytic methods. The a priori model can contain traffic model as well as 
router and link models. For a tractable analysis, we assume that we can divide 
the complex a priori model to a set of more simple models, with the same 
model structure M but with different values of parameter 9. More formally, we 
assume that the a priori model can be described by a stochastic Finite State 
Machine (FSM) with each state i € {!,..., X}, of the FSM characterized by 
a set of parameter value 0i. The observation done when the FSM is in state 
i is determined by a stochastic function iFj{M{9j)). The sequence of state of 
the a priori FSM (Y = {Yj}J=i) is unobservable and should be estimated as 
a side result of the model calibration. In a stochastic modeling wording, the 
previous assumption means that observations x follow a Hidden Markov Model : 
the unobservable sequence of state Y is a Markov chain and the observations at 
time j are a stochastic function of the state Yj. The assumption of an a priori 
FSM is not too restrictive and a large spectrum of models are compatible with 
it. 

4.1 The Expectation Maximization Algorithm 

In this context, the Expectation Maximization (EM) algorithm is a valuable 
approach for maximum likelihood parameter estimation. It has a number of 
advantages, especially its stability : each iteration of the EM algorithm increases 
the likelihood of the model; this ensures the convergence of the algorithm to 
a local, but not necessarily global extremum of the likelihood. Another major 
advantage of the EM algorithm is its numerical complexity. A direct computation 
of the likelihood would require terms, where K is the number of different 
states for the a priori FSM models , and T is the number of observations. On 
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the other hand, the numerical complexity of the EM algorithm is of the order of 
K^T. 

The EM algorithm involves maximizing iteratively with respect to 0 a func- 
tion 

g(0,0fe) = E{L(X,Y;0)|X = x;0fe} (2) 

where Y is the unobserved Markov state sequence, X is the vector of observations 
-a probabilistic function of Y- and L(X, Y; 6) denotes the log- likelihood of the 
’complete data’ Z = (X, Y) when 6 is the parameter of the model. Expectation 
involved in the computation of Q{0, 9k) is the expectation given that X = x and 
given that 6k is the parameter of the model. 

Y1 — ► Y2 — ► Y3 — Y4 — ► Y5 — Y6 ••• Latent data 



XI X2 X3 X4 X5 X6 ... Observations 

Fig. 2. Dependence structure of the Hidden Markov Model. 



The dependence structure of the HMM (see Fig|3) is such that the complete 
log-likelihood L(X, Y; 9) can be split into two terms L(X,Y; 9) = L{Y\ 9) + 
L(X. I Y; 0) so that the optimization can be performed independently for the 
parameters of the unobserved state process, and for the parameters of the ob- 
servations; these optimizations are performed in spaces of a smaller dimension. 
In many cases there is even an analytic expression for the maximizer in 9 of the 
function Q{9,9k) so that the maximization step is very fast. 

Each iteration of the EM algorithm can consequently be decomposed into 
two steps : 

1. Step E (Expectation): 

Compute Q(0,0fe) =E{L(X,Y; 0) | X = x, 9k}. 

2. Step M (Maximization) : 

Maximize Q{9,9k) with respect to 9 : 

9k+i = ArgmaxQ(6', 0fe). 

0 

The maximization involved in the M step is analytical and does not require inten- 
sive computation; the integration involved in the E step requires the computation 
of a non linear filter; this computation is based on the Forward Backward (or 
Baum- Welches) algorithm m 

4.2 State Sequence Estimation 

Application of the EM algorithm make possible to calibrate the parameters 
of the a priori model. The next step is to estimate the sequence of states Y. 



Measurement Based Modeling of Quality of Service in the Internet 



167 



This sequence will help on interpreting the observed vector x with the a priori 
constructive model. Two approaches can be used to reach this goal : the first one, 
the Marginal Posterior Mode - MPM, attempts to estimate the most probable 
state at step t (Yt) based on the observations process up to step t (xq). The 
second approach, based on the Viterbi algorithm, estimates the most probable 
state sequence knowing the overall observations . 

The Marginal Posterior Mode. The Forward-Backward algorithm used in 
the EM algorithm produces as a byproduct the a posteriori Marginal distribution 
7 ((z) = Prob\^Yt = i \ X} which leads to the MPM estimate which is the most 
probable state at time t, given the observed sequence X. This estimate is the 
maximizer in i of 7 t(f) : 



,MPM _ ^ ^ 7t(f). 

* ^Kr<K ' ^ ’ 



( 3 ) 



The Viterbi Algorithm. Another approach to state sequence estimation is 
based on the Viterbi algorithm. This algorithm produces a sequence 

Y = {%,%,■■■ St) 

that is globally the most likely given the observation vector X = Xq . This es- 
timate produced by the Viterbi algorithm will not coincide, in general, with 
the Marginal Posterior Mode at any time index t. In general, the sequence pro- 
duced by the Viterbi algorithm presents longer homogeneous intervals than the 
sequence produced by the Maximum Posterior Mode criterion. 

The a posteriori log-likelihood L{Y | X = x; 0) is equal to the complete 
log-likelihood up to an additive constant, L(Y | X = a;; 0) = L(Y,X = x; 0) — 
L(X = x; 0) so that the maximizer of L{Y | X = x; 0) is the maximizer of 
L(Y, X = x; 9). The dependence structure of HMM (Fig. leads to an additive 
expression for the complete log- likelihood: 

T 

L(Y,X = x; 0) = ^ (logPro6{Y*+i | Yt} 

+ log Prob{xt+i I Yt+i}). 



This sum can be represented graphically as the length of a path in a lattice. 
The complete log-likelihood L(Y,x; 6) is the total length of the path Y in the 
lattice. The Viterbi algorithm retrieves the longest path in this lattice. 

The additive form of the criterion makes it possible to construct a dynamic 
programming algorithm to solve this optimization problem m 

In the previous sections we have developed the proposed modeling method- 
ology based on a priori constructive model. In the following section we will show 
an application of this approach to the modeling of losses observed on an Internet 
Path. 
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5 Case Study: Modeling of Losses Observed 
on an Internet Path 

In this section we apply concepts developed in previous sections. For this purpose 
an observed loss trace collected from the Internet following the IPPM Metrics 
recommendation El is used. We sent over the Internet a sequence of regularly 
spaced packets of equal size from Paris to different addresses in the United States 
and Europe. The packets were regularly spaced with a delay between packets 
of Z\ = 50 msec. The loss trace X = is defined as Xt = 0, if the 

packet reaches its destination and Xt = 1, if the packet is lost. The methodology 
described in this paper was applied to a sequence of 10000 packets (corresponding 
to a period of 500 seconds with a constant inter packet time of 50 msec) . 

In a recent paper EI> the previously described EM algorithm is applied to 
derive an Hidden Markov Model (HMM) for the channel connecting a source to 
a destination over an Internet Path. This HMM ’switches’’ between K different 
states following an homogeneous and ergodic Markov chain. In each of these K 
states, the channel is uniformly blocking or passing. This defines a probability 
that a packet is lost at a time when the channel is in state i {1 < i < K). It is 
shown in HH on 36 losses traces collected over the Internet, that not more than 
4 states are needed for modeling losses by the HMM. 

However, the model developped in El remains strictly descriptive as the 
states are not related to any constructive model. In this section, we use the 
methodology described in this paper to extend this descriptive model by an 
a priori constructive model. We will not go here through the details of the 
derivation steps and the evaluation of the model. This task is devoted to a 
companion paper IE- 

Based on the descriptive HMM model develloped in m and on the approach 
proposed in we assume an a priori constructive model (fig. 0|) for the net- 
work consisting of a single bottleneck model with transmission capacity /i and 
buffer size M. This bottleneck is fed by a traffic following a Markov Modulated 
Poisson Process (MMPP). The MMPP traffic model describes the traffic enter- 
ing the bottleneck, which will be the sum of the background Internet Traffic and 
measurement probe traffic. This model assumes that the traffic switches between 
K different states following a continuous time homogeneous Markov chain with 
infinitesimal transition matrix Q. Each state represents an homogeneous poisson 
process with parameter -|- 7, {!,... ,Ff}, where 7 is the measurement probe 

traffic and A^ represent the Internet background traffic. 

We suppose that the capacity p, is known and constant. This assumption is 
practically sound as it is possible to estimate faithfully this parameter by packet 
pair (or packet train) procedure that is well described in the litterature jO). This 
a priori model can be easily fitted into the framework of FSM described in 
section PHTI Each state i of the FSM is related to one state of the MMPP traffic 
process with a parameter set 9i = (Q,Xi). The observed packet loss trace is 
related to the parameter set 6i by a stochastic function iF{Ai{di)). In the sequel, 
this function is intuitively derived. A more precise derivation is presented in a 
companion paper m- 
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Background 




Fig. 3. A priori constructive model of the network 



The simple theory of M/M/l/K queue shows that a flow following a Poisson 
arrival process with parameter A and feeding a queue with finite buffer size M 
and processing capacity ^ will see a loss rate P calculated by 0 : 



(1 - P)P^^ 

1 — 



( 4 ) 



where p = - is the load factor. 

Nevertheless, if M is sufficiently large (more than 10) and p is not too small 
(which is the case in real network conditions) this relation simplify to : 



PKil- 



1 

P 



( 5 ) 



which shows a simple relation between loss rate and load factor, independently 
of the buffer size. This relation can be used even if p > 1. 

The relation given in equation is the basis of the derivation of the needed 
function T . If the bottleneck queue is fed by an MMPP, the behavior can be 
different from the simple poisson process. However, it can be shown that the 
overall behavior can be easily derived using the above described analysis of the 
M/M/l/K queue. We will assume that the mean sojourn time in each state i 
of the MMPP computed by ^ is much larger than the time scale of the queue 
defined by r = ^ meaning that the queue has enough time to reach its stationary 
distribution in each state of the MMPP before a transition occur. The global 
behavior of the queue can be described as a mixture of the stationary behavior 
in each state. More specifically, the observed loss rate at the output of the queue 
when the input traffic is in state i can be derived easily by applying equation 0 
to the input load factor in this state pi = 

In the context of our studied loss trace, measurement probes are each sep- 
arated by 50 msec. This time is much larger than the time scale of the queue 
as defined above. This means that conditioned on the state of the input MMPP 
process, the losses can be assumed as independent, as the queue has enough time 
to be completely renewed between the arrival of two probe packets. 

This means that the losses observed at the output of the a priori constructive 
model FSM for a measurement traffic is related to the state i of the FSM by the 
way of the following stochastic function : a probe packet is lost with probability 
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P^ = l--. ( 6 ) 

P^ 

This characterized the stochastic observation function P(A4(0i)), with 0, = 
(Q, Ai). With this last information the a priori constructive model is completely 
defined as a single bottleneck queue fed by an MMPP traffic (with some as- 
sumptions) and the measurement probe traffic. The observation are related to 
the constructive model by the above described stochastic function P{M{9i)). 

Now we must address the calibration task. We need to find the set of values 
Pi (or equivalently Pi) and the infinitesimal transition matrix Q that best fit an 
observed loss trace. 

We described previously the descriptive HMM that was developped earlier in 
H3 for describing network channels. This descriptive model calibration follows 
the EM algorithm to find a set of value loss rate Pi and a state transition matrice 
r satisfying the maximum likelihood criterion. The result of this analysis can 
be used to calibrate our a priori constructive model. We have clearly Pi = pi. It 
remains to calibrate the infinitesimal matrix Q' . The descriptive HMM provides 
a transition matrix T representing the transition of the MMPP as seen by the 
probe packets. If the probe packet are separated by A unit of time we have the 
following relationship between F and Q' : 

T = e^'^. (7) 

This equation has clearly an infinite number of solutions. Each solution will 
describe the observation. One of these solutions, that we use as the estimate of 
Q’ (Q), is calculated by diagonalizing the estimated transition matrice for the 
descriptive HMM. 



r = VDV~^ 

Q' = ^V-Hog(D)V 

where log(Z?) is the diagonal matrix containing on its diagonal the logarithm of 
the diagonal of the matrix D. This solution has the nice property of having the 
same mean sojourn time in each state as the estimated HMM. 

Now, let see the application of the above procedure to a real loss trace (Tab. 
[Q. This trace was measured between France and USA and the bottleneck ca- 
pacity at the time of measurement was 2 Mbps. The loss rate measured over 
windows of 5 sec on this loss trace are depicted in Fig 0 



Table 1. Basic parameters of the observed loss traces 



Interval 

(msec) 


Mean 
loss rate 


50 


18.58% 
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Loss rate observed over windows of 5 sec 




Fig. 4. Loss rate observed over windows of 5 secs. 



The application of the HMM estimation process generates the following re- 
sults (Tab. 0) : The stationary state distribution is also shown as tt. 



Table 2. HMM parameters calibrated on the observed loss process 



(0.95,0.206,0.07) 



r = 



0.9370 0.0623 0.0006 
0.0026 0.9973 0.0002 
0.0000 0.0004 0.9996 



TT = (0.0267,0.6581,0.3152) 



Following the previously described procedure, this parameter can be trans- 
formed to parameters of the a priori constructive model. This transformation 
results in the values shown in table El The stationary distribution of the MMPP 
is the same as the HMM. 

This analysis shows that the observed loss process can be interpreted in the 
context of an a priori model described before by this set of input parameters. 
This interpretation is very helpful. It makes possible to simulate loss traces sim- 
ilar to the real trace by feeding the calibrated parameter to a queueing system 
simulator. It also help to better understand what happen inside the network. 
For example the stationary distribution shows us that the load factor of the 
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Table 3. Calibrated parameters for the a priori model 



p = (20, 1.2594, 1.07) 



Q' = 



-0.0651 0.0645 0.0006 

0.0026 -0.0028 0.0002 
0.0001 0.0003 -0.0004 



Estimated state by MPM method 




Fig. 5. Estimated state transition for the measured trace. 



network was as high as 20 for at least 2.6% of the time and that 65% of the time 
the load factor was around 1.24. The estimated state transitions are shown in 
Fig.Elfor 10,000 packets (500 secs). A clear regime transition around packet 338 
between state 2 and 3 can be seen. It means that the load factor went from 1.259 
to 1.07. The inverse transition occurs around packet 483. Some spurious transi- 
tion between state 1 and 3 representing surging traffic load that has completely 
congested the network (with a load factor around 20) can be observed. 

6 Conclusion and Perspectives 

In this paper, we developed a new modeling methodology for analyzing and in- 
terpreting QoS values collected by active measurement and associated with an a 
priori constructive model. This approach is the opposite of the classical construc- 
tive modeling approach where we start by making assumptions on the inputs and 
then find the performance metrics. Here, we start with observed performance (or 
QoS) measures and derive inputs that have lead to these observations. 
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This approach needs the introduction of the hidden variable statistical frame- 
work. We have described this framework and given some guidelines for the EM 
algorithm. It has helped in formalizing the approach in the context of the well 
documented Hidden Markov Model. 

Finally, we have illustrated the approach in the context of the modeling of 
the loss observed in an Internet path. This example shows that the proposed 
solution is valuable in the context of the interpretation of QoS indices collected 
by active measurements. 

Although it is more complex, this approach provides a powerful solution to 
address the complex problem of QoS estimation and network modeling. This 
methodology as well as practical case studies will be developed in the future. 
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Abstract. Distributed Admission Control in IP DiffServ environments is an 
emerging and promising research area. Distributed admission control solutions 
share the idea that no coordination among network routers fi.e. explicit 
signaling) is necessary, when the decision whether to admit or reject a new 
offered flow is pushed to the edge of the IP network. Proposed solutions differ 
in the degree of complexity required in internal network routers, and result in a 
different robustness and effectiveness in controlling the accepted traffic. This 
paper huilds on a recently proposed distributed admission control solution, 
called GRIP (Gauge&Gate Reservation with Independent Probing), designed to 
integrate the flexibility and scalability advantages of a fully distributed 
operation with the performance effectiveness of admission control mechanisms 
based on traffic measurements. We show that, in the assumption that traffic 
sources are Dual-Leaky-Bucket shaped, GRIP allows providing deterministic 
performance guarantees (i.e., number of accepted flows per node never greater 
than a predetermined threshold). Tight QoS performance are made possible 
even in impulsive load conditions (i.e., sudden activation of several flows), 
thanks to the introduction of a “stack” mechanism in each network node. A 
thorough performance evaluation of the conservative effects of the stack show 
that the throughput reduction brought about by this mechanism is tolerable, and 
limited to about 15%. 



1 Introduction 

It is well known that the IntServ approach, while allowing hard QoS guarantees, 
suffers of scalability problems in the core network. This has motivated a large 
research effort to develop a stateless QoS provisioning approach, i.e. the DiffServ 
paradigm. The idea that per-flow admission control needs to be introduced in IP 
DiffServ networks, in order to control traffic load and therefore to provide 
quantifiable service enhancements, is gaining consensus in the Internet research 
arena. As suggested in the recent RFC [1], in a DiffServ framework, it appears 
necessary to define an “admission control function, which can determine whether to 
admit a service differentiated flow along the nominated network path”. In fact, an 
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apparent limit of the DiffServ framework stays in the fact that this approach lacks a 
standardized admission control scheme, and does not intrinsically solve the problem 
of controlling congestion in the Internet. Upon overload in a given service class, all 
flows in that class suffer a potentially harsh degradation of service. 

Several Distributed Admission Control algorithms recently appeared in the 
literature. These proposals share the idea that each network node shonld accept new 
flows according to end-to-end congestion measnrements and decision criteria. No 
coordination among network routers (i.e. explicit signaling) is necessary and the final 
decision whether to admit or reject a new offered flow is pushed to the edge of the IP 
network. 

A critical issue is the reliability and effectiveness of end-to-end mechanisms. 
Therefore a novel approach, which integrates internal measurements in each node 
with an end-to-end lightweight probing phase, has been proposed in [2, 3], under the 
name GRIP (Gauge&Gate Reservation with Independent Probing). In a GRIP 
compliant IP domain, each network node is endowed with a measurement module, 
which continuously monitors the QoS traffic queues and properly decides, by means 
of proprietary decision criteria, to switch from an Accept state, where new admission 
requests can be accepted, to a Reject state, where no further QoS flows can be 
admitted. The internal state of each router is advertised by properly handling probe 
packets. These packets are emitted by end-systems before data transfer in order to test 
the resource availability within the relevant domain, that is to get permission to start 
QoS sessions. Probe packets are forwarded by nodes in Accept state and discarded by 
those in Reject state. GRIP is DiffServ compliant since best effort, probe, and QoS 
packets are distinguished and properly treated accordingly to their DS Code Point 
only. In addition, in [4] we have shown that the GRIP operation is semantically 
compatible with the AF PHB [5]. 

Each GRIP node is independent from others and each GRIP domain can adopt 
proprietary strategies to provide QoS. No information is needed to be exchanged 
among nodes to set internal states, thus integrating flexibility with the performance 
effectiveness of internal traffic measurements. The probes have the task of conveying 
to the end-nodes the status of the network. The end-nodes can exploit such 
information to take per flow accept/reject decisions. This idea is close to what TCP 
congestion control technique does, but it is used in the novel context of admission 
control. GRIP is related to the family of distributed schemes recently proposed in the 
literature under the denomination Endpoint Admission Control (EAC) (e.g., [6] and 
references therein contained). However, in our opinion, the GRIP solution overcomes 
some problems of early EAC schemes (see [2, 3 and 4] for more details). 

In addition, we stress that GRIP i) does not use a per flow signaling protocol within 
the routers, for obvious reasons, that is to avoid the pitfalls of IntServ; ii) does not 
introduce packet marking schemes within routers (as in [7]) or any other substantial 
modifications of the basic IP way of operation; iii) does not force routers to parse and 
capture higher layer information (e.g. TCP SYN or SYN/ACK, as in [8]), which 
would mean modifying the basic router operation. What GRIP does is to implicitly 
convey the status of core routers to the end points, so that the latter devices can take 
learned admission control decisions, but without violating the DiffServ paradigm. 

The problem we face in this paper is to provide strict QoS guarantees in each 
operational condition. To this purpose, information about QoS traffic flows 
characteristics is needed in order to correctly decide the internal state. Thus, we 
assume that traffic sources are regulated at network edges by standard Dual Leaky 
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Buckets, as in the IntServ framework. In addition, network nodes have to be aware of 
the adopted DLB parameters. In the homogeneous traffic case (that is when all flows 
are regulated by DLBs with the same parameters) this information is fixed once and 
for all. In the heterogeneous traffic case, different alternatives can be envisaged (see 
Section 5). In any case, GRIP does not require to signal explicit information about the 
traffic mix composition, that is how many active flows per each regulator type are 
present in each node. This duty is assigned to the measuring and estimator module in 
each node, which operates on traffic aggregates only. We stress that the above 
assumptions are necessary to guarantee performance. In other words, removing some 
key assumptions means giving up providing assured QoS levels. Finally, we note that 
the measurement procedure proposed in this paper could be applied also in 
frameworks different than GRIP. 

As for the organization of the paper, in Section 2 we will describe the GRIP 
operation in more details. Section 3 provides a description of the adopted 
measurement scheme in a homogeneous traffic scenario. Section 4 is concerned with 
the relevant performance evaluation. Section 5 deals with a heterogeneous traffic 
scenario. Section 6 is dedicated to our conclusions. 



2 GRIP: Gauge&Gate Reservation with Independent Probing 

We envision GRIP as a mechanism composed of two components: (i) GRIP end- 
nodes operation, (ii) GRIP internal router operation. For clarity of presentation, in 
what follows, we identify the source and destination user terminals as the network 
end-nodes, taking admission control decisions. However, for obvious security 
reasons, such end-nodes should be the ingress and egress nodes, under the control of 
the network operator(s) (more discussion on security issues can be found in [4]). 



2.1 GRIP End-Nodes Operation 



grip’s end nodes operation is extremely simple. ^ illustrates the setup of a 
monodirectional flow (from source to destination). When a user terminal requests a 
connection with a destination terminal, the Source Node starts a Probing Phase, by 
injecting in the network just one Probe Packet. Meanwhile, it activates a probing 
phase timeout, lasting for a reasonably low time. If no response is received from the 
destination node before the timeout expiration, the source node enforces rejection of 
the connection setup attempt. Otherwise, if a Feedback packet is received in time, the 
connection is accepted, the probing phase is terminated, and control is given back to 
the user application, which starts a Data Phase, simply consisting in the transmission 
of information packets. The role of the Destination Node simply consists in 
monitoring the incoming IP packets, intercepting the ones labeled as Probes, reading 
their source address, and, for each incoming probe packet, just relaying with the 
transmission of a feedback packet, if the destination is willing to accept the set-up 
request. The only mandatory requirement is that Probes and Information packets are 
labeled with different values of the DS codepoint field in the IP packet header. This 
enables destination nodes to distinguish between Probes and Information packets. 
Probing packets do not carry information describing the characteristics of the 
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associated data traffic (e.g., peak bandwidth). This information is implicitly conveyed 
by means of the DSCP tag (i.e., a given kind of data traffic is associated with a given 
DSCP tag). 

Note that the described GRIP operation can be trivially extended to provide setup 
for bidirectional connections. 



Source Source 
a p p node 



Destination Destination 
node a p p 





b) 



Fig. 1. End point and router operation 



2.2 GRIP Router Operation 

The GRIP-aware router operation is illustrated in |^ig. it - For convenience of 
presentation, we assume that the router handles only homogeneous GRIP controlled 
traffic. Other traffic classes (e.g., hest-effort traffic) can he handled by means of 
additional queues, eventually with lower priority. At each router output port, GRIP 
implements two distinct kind of queues, one for data packets, i.e., belonging to flows 
that have already passed an admission control test, and one for probing traffic. 
Packets are dispatched to the respective buffers according to the probe/data DSCP tag. 
This enables routers to provide different forwarding methods for Probes and 
Information packets (e.g. granting service priority to Information packets). As a 
consequence, the performance of the accepted traffic is not affected by congestion 
occurring in the probing buffer. 

The GRIP router measures the overall aggregate accepted traffic. On the basis of 
running traffic measurements, the router enforces a Decision Criterion, which 
continuously drives the router to switch between two states: ACCEPT and REJECT. 
When in the ACCEPT state, the Probing queue accommodates Probe packets, and 
serves them according to the described priority mechanism. Conversely, when the 
router switches to the REJECT state, it discards all the Probing packets contained in 
the Probing queue, and blocks all new Probing packets arriving. 

In other words, the router acts as a gate for the probing flow, where the gate is 
opened or closed on the basis of the traffic estimates (hence the Gauge&Gate in the 
acronym GRIP). The Decision Criterion may also be based on different procedures 
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than traffic measurements, which can be signaling mechanisms used by underlying 
layers (e.g. ATM) or simpler approaches (e.g., limiting accepted probe packets via 
probe buffer limitations) or other, tunable proprietary schemes. This mechanism 
provides an implicit signaling pipe to the end points of which the network remains 
unaware, leaving each router in charge of deciding whether it can admit new flows, or 
it is congested. Since the distributed admission control decision is related to the 
successful reception at the destination of the Probing packets, locally blocking 
probing packets implies aborting all concurrent setup attempts of connections whose 
path crosses the considered router. Conversely, a connection is successfully setup 
when all the routers crossed by a probing packet are found in the ACCEPT state. As 
regards performance, it is easy to conclude that the level of QoS support provided 
depends on the degree of effectiveness of the Decision Criterion implementation. In 
particular, in this paper we propose a simple and effective criterion based on traffic 
measurements and suitable assumptions on the offered traffic. 



3 Decision Criterion for Homogeneous Traffic Scenario 

In this paper we focus on the performance evaluation of GRIP in a single DS domain. 
Specifically, in this section, we deal with homogeneous traffic sources. 



3.1 Traffic Source Model 

Traffic sources offered to the considered domain are regulated at the boundary of the 
domain, by standard Dual Leaky Buckets (DLBs), as in the IntServ framework. This 
implies that, within the domain, each source is fully characterized by its Traffic 
Descriptors, given in terms of three DLB parameters, namely Peak rate, P^, 
Sustainable Rate, r^, and Token Bucket Size, 

The Peak Rate (bytes/s) represents the maximum emission rate of the source. In 
what follows, we neglect the Peak Tolerance, i.e. a parameter sometimes specified in 
DLB characterization, which accounts for the peak rate variations within a given time 
frame. The Sustainable Rate (bytes/s), in conjunction with the Token Bucket Size 
allows to jointly characterize the average emission rate of the source as well as its 
variability. In addition, we specifically require the DLB to enforce that traffic does 
not underflow the sustainable rate specification. This is accomplished by the emission 
of "dummy" packets (e.g. empty packets), in order not to waste token. In other words, 
the traffic coming out from the DLB fully uses all the opportunities allowed by the 
regulating device (this does not mean that sources emit always at peak rate!). 

Note that this assumption is not an unrealistic one: if a user requests a QoS service 
and pays for it on the basis of the selected DLB parameters, it is likely that emission 
opportunities will not be wasted (greedy sources). In addition, without this 
assumption, as in any measurement-based scheme, it is not possible to correctly 
estimate the admitted flows and thus guaranteeing performance. The consequence is 
that the number of bytes, b(T), emitted by a source during an arbitrary time window of 
size T (seconds) is upper and lower bounded as follows: 
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max( r^T — Bj^ ,0 )<b(T )< r^T + Bj^ (1) 

The upper bound results from the standard DLB operation, while the lower bound 
is a consequence of our assumption of greedy sources. Finally, in this Section, we 
assume that the sources are homogeneous, that is they are regulated by means of the 
same DLB parameters. No additional assumptions are done on the traffic emission 
pattern, i.e., b(T) is a random variable with general distribution in the above range. 



3.2 Decision Criterion 

The localized Decision Criterion running on each router's output link is based on the 
runtime estimation of the number of the active sources. Our proposed Decision 
Criterion is founded on a strong theoretical result, proven in [9, 10], which states that, 
in the presence of DLB regulated sources, target performance (e.g., loss/delay) levels 
are hard-guaranteed whenever the maximum number of admitted sources does not 
overflow a suitable threshold K. For our scopes, we consider A" as a "tuning knob", 
which allow the domain operator to set target performance levels [11]. The detailed 
relationship between K and the guaranteed performance levels results from an off-line 
computation, following the approach presented in the above references. In other 
words, we assume that the operator chooses target performance levels; the latter are 
mapped in a value of K by means of the DLB characterization and of a specific 
algorithm (e.g., the one of [10]) and GRIP enforces such value. GRIP is completely 
independent by the specific algorithm chosen to evaluate K (and thus whatever other 
algorithm could be selected, including evaluating K by means of trials in a specific 
network scenario). 

We now provide a measurement technique targeted to estimate the number of 
admitted traffic sources on the basis of the traffic measured over the considered link 
during a fixed size sliding window of size T (seconds). To define the length of such 
window, we use the period of the worst case DLB output [10], characterized by an 
activity (On) period with emission at the Peak rate and a silent (Off) period, both with 
deterministic length. The length of this period is: T^^^=(Bj./’J/(rJP^-rJ). The window 
size is T, with T > , in order to catch at least the minimum periodicity of the 

source, associated to the worst case DLB output. 

To go further, first, let us note that, in a static scenario (i.e. no arriving and 
departing flows), being N \he fixed number of offered flows during the time window 
r, Eq. (1) yields the following inequality on the number of bytes, A(T), measured in 
the time window T : 

Amin =N(rsT-BTs )<A(T )< =N(rsT + Bjs ) (2) 

In alternative, we can read Eq. (2) as follows: given a window T and a number A(7) 
of bytes measured within the window, the number of offered flows is a random 
variable in the range: 

A! MIN = [Afr )/( rsT + Bjs )1< N ^Nmax - \_A( T )/( r^T - Bjs )J (3) 

If no conjecture is made on the statistical properties of the emission process for 
each source (we do not rely on how the traffic source fills the DLB), the distribution 
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of N between these two extremes remains unknown. To provide a conservative 
estimate, the admission control scheme estimates the number of allocated flows as 
^e«-^MAx’ ^ accepted if < K — 1 (i.e. if there is space for at least 

one more flow). The router is in the ACCEPT state as long as 

T )/( rsT-BTs)]<K-l (4) 

Finally, we note that in this static scenario (i.e. no arriving and departing flows) 
Eq. (3) shows that, as the window size increases, the bounds and become 
closer, until they become equal to each other, thus allowing to evaluate exactly the 
number of offered flows. This happens for T >( 2N — 1 )Bj^ / . Such window size 
(that we can denote as exact-window) can last a significant amount of time. We stress 
that our measurement procedure operates in background and does not influence the 
flow set-up time, as in some other schemes. However, even if we are not constrained 
to adopt a very short measurement time, we want to avoid using always an exact- 
window, since its size can reach values in the order of several minutes, depending on 
the DLB parameters. This is because a too large window has some cons, in real non- 
static scenarios, which will be discussed in the sequel. The solution is to trade off the 
window size with the accuracy in the estimation of N. In any case, the choice of T 
influences the network efficiency but not the performance perceived by users, which 
are always guaranteed. 



3.3 Stack Protection Mechanism 

A limit of the above analysis is that it does not account for transient phenomena. 
Consider, in fact, a router in the ACCEPT state, i.e. for which condition (4) is 
verified. If a large number of new flows are offered in a short time frame, the 
measurement mechanism is not sufficient to protect the system from over-allocation, 
i.e. all the newly incoming flows may be accepted. In fact, when a router "accepts" 
the Probe packet of a given flow (i.e., the probe finds the gate opened), the relevant 
Data packets are not yet emitted by the source. In other words, it exists a transient 
time during which the router is loaded with a new flow, but A(T) in Eq. (4) only 
partially accounts for the new traffic contribution. Such transient time is upper 
bounded by T, if we assume that the round trip delay is smaller than T (which is a safe 
assumption, since the measurement window lasts typically several seconds, see 
below). 

To overcome the above described problem, and to provide strict guarantees in any 
operational condition, we introduce a protection mechanism based on a "stack" 
variable, which keeps memory of the amount of "transient" flows. Whenever a probe 
packet is accepted, the stack is incremented by one. A timer equal to the duration of 
the measurement window T is then started, and the stack is linearly decremented at a 
rate MT until the timer expires. Thus, at time t, if a single flow has been admitted at 
time fj, neglecting the Round Trip Delay, the stack is equal to 7 — (f — f; )/T and thus 
it compensates the lack of packets emitted by the source in the time interval 
To account for the Stack variable, it is simply necessary to modify the decision 
criterion defined in condition (4) as follows, i.e. the router remains in the ACCEPT 
state as long as: 
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N. 



est+STACK 



= [_A(T)l(rsT-Bjs) + STACK \<K-1 



(5) 



where the STACK variable accounts for the sum of the contributes of each setting up 
flow. As a side note, we remark that the stack is a simple aggregate variable and 
hence it does not require to process probing packets and extract or maintain state 
information. While the stack mechanism is mandatory to protect the system from QoS 
impairments caused by burst arrivals of newly offered flows, in steady state 
conditions its effect is to reduce the system throughput, for two reasons. 

1. In steady state conditions, a new flow replaces, in average, a departing flow. This 
means that the last bytes emitted by the departing flows compensate, in average, 
the lack of bytes emitted by the newly incoming flow during the measurement time 
T. However, since these bytes are accounted in the stack variable, condition (5) 
results overly conservative. The performance evaluation of this phenomenon is 
carried out in the next Section. 

2. The stack variable is incremented for each new incoming probing packet, 
regardless of the fact that the locally accepted probing packet will results in a new 
accepted connection, or it will be blocked by a subsequent router along the path (in 
which case, the stack will provide transient reservation of system resources for a 
non incoming flow). Results that show the effect of probing packet losses in 
subsequent routers along the path are giving below, and show that the throughput 
impairment is marginal. 

To discuss the effect of the stack when not all probing packets yield to a flow 
setup, it is in principle necessary to simulate a multi node topology. To simplify the 
analysis, we have considered a simulation model where only a single node is 
considered but where, with a given probability, a probe packet is assumed blocked in 
later stages of the network. This simulated scenario allows us to evaluate the 
performance drawbacks induced in GRIP by the effect 2) discussed above. In fact, in 
GRIP, each node independently decides whether to accept or reject a new flow, but 
there is no explicit mean to determine whether a locally accepted source has been 
blocked by later stages of the network. 

The linear stack mechanism apparently should provide overly conservative results, 
as each accepted flow is accounted, during the measurement time T, as an incoming 
one. However, our numerical results show that the performance degradation induced 
by the stack implementation is almost negligible. Fig, 2 shows the number of 



admitted flows in a generic router's output link versus the simulation time, for 
different values of the probing packets blocking probability in later stages of the 
network (offered load equal to 240 Erl). We set a target number of admissible flows, 
K, equal to 100. Ideally, the number of admitted flows should be close to K, and 
respect the strict constrai nt of no t overflowing such value. The curve relevant to a 
probe loss equal to 0% in HO after a transient period, stabilizes around a value of 
about 80, instead of 100. Such loss of efficiency is due to effect 1) discussed above 
(that is to the transient effect of the stack mechanism) and to the conservative 
evaluation of the number of active flows, evaluated according to Eq. (4). The curves 
relevant to a probe loss equal to 25% and 50% account also for the effect 2) discussed 
above. However, the throughput performance is very close to that obtained in the 
optimal case of no probing packets blocking. The most notable effect is a faster 
transient in the case of no probing packet block. 
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Fig. 2. Number of admitted flows (CBR) versus the simulation time: T=20 seconds, blocking 
probability of probing packets in subsequent routers equal to 0, 25%, 50% and 75% 

In the case of 75% probing packets blocking probability, the number of allocated 
flows reduces simply because the "valid" offered load gets lower than the considered 
node capacity, which is equal to 100 Erl (while in this case we are offering 60 Erl, i.e. 
25% of 240 Erl). These results hint that the effect of probe loss on the stack 
mechanism is very limited. As a consequence, in the following, we will focus on the 
evaluation of the other two effects mentioned above (i.e., transient effect of the stack 
mechanism and conservative evaluation of the number of active flows). 



4 Performance Evaluation of the Homogeneous Traffic Scenario 

A detailed performance evaluation of the effect of the stack variable requires 
considering a dynamic scenario, accounting for flow departures and arrivals. In this 
Section, we derive the utilization coefficient of a generic router's output link and an 
upper bound and a lower bound of the same quantity. 



4.1 Evaluation of the Utilization Coefficient 

Assume that the duration of offered flows is exponentially distributed, with mean 
value ]/jU. To analyze high load conditions (which are the most critical ones), we 
assume an impulsive load model (see e.g. [12]), in which a new flow is accepted to 
the system whenever condition (5) is verified, i.e. the router switches from REJECT 
to ACCEPT state. In other words, users continuously submit new call requests as soon 
as the system leaves the full occupancy status. Due to the DLB regulation, the average 
emission rate of each traffic source is assumed equal to the DLB sustainable rate, r^. 
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Now, assume that the window size T is small with respect to the mean flow 
duration l/jx. In these conditions, the probability that a new flow activates and 
deactivates within the time T is negligible. This approximation has the effect of over- 
estimating the utilization coefficient, since the STACK mechanism introduces 
inefficiency. Obviously, this effect increases when the difference (1/ji-T) decreases. In 
any case, this effect can be accounted for, even if, for simplicity, in this paper we 
neglect it (given its relatively small contribution to overall performance, see the 
numerical results). 

Let us now define the quantity T as the measurement scheme “reaction time” 
i.e., the time elapsing between the instant of time a flow departs from the system, and 
the instant of time a state transition REJECT to ACCEPT occurs in the router (and 
thus, thanks to the impulsive load assumption, a new flow is admitted). Eurther, we 
define as T , the average value of T .In other words, the system can not realize 
immediately that a flow has switched off because, since the traffic sources are VBR, a 
temporary decrease of the measured bit rate could be due to source activity variations. 
Thus the measurement procedure needs a time to distinguish between real flow 
de-activations and statistical fluctuations of active source bit rates. Considering that, 
owing to the impulsive load assumption, each departing flow will be replaced in 
average after a time with a new incoming one, the average number of active 

flows during the time window T is given by: 

T — T 161 

N = Nq+N^ 

being N„ the number of flows that remain active during the whole measurement 
window time T, and being the number of flows departed during T, and replaced by 
a newly incoming flow. In turns, the average number of departing flows during the 
time interval T is given by 

Nd=~NjxT (7) 



which combined with (6) yields: 

Nq= n{ 1- jxi^ - T^eact.ave )) 



(8) 



We are now able to write condition (5), replacing A(T) with the sum of the average 
contribute of the N„ flows that remain active during T, and of the departing/arriving 
flows: 



NprsT + ^ ^ ^ 

rsT-Bfs 

Noting that flow arrivals are uniformly distributed in the time window T, the 
average stack value is STACK=N/2. Substituting this value in (9), and owing to (7), 
(8), we can finally write a single equation which yields the average system throughput 
as: 



Nrs _ K(I-TqppIT) rs 

C {1 + b/2(T-Toff))C 



( 10 ) 
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where Tqff = Bj^/r^ is the maximum silence period allowed by the DLB. Note 
that to obtain this result, we do not have to explicitly provide a value for 
(which, in any case, is trivially shown to be given by = ~Tqff V2), 

since in the evaluation of the utilization coefficient, we have both a positive and a 
negative contribution of this quantity, which balance themselves. 

From Eq. 10, it is straightforward to calculate the optimal value of the 
measurement window T that maximizes p; 

Topt = Tqff + 4^Toff 



This is an important result, since it allows dimensioning the measurement window. 
Equations (10) and (11) suggest also the following considerations. If the mean flow 
duration increases, will increase too (for jj. —^O^Tgpj —^oo). As regards the 

utilization coefficient, when jd —^0 , the STACK contribution decreases and p 
increases with the measurement window size T: 



p-^o C 



Krs 

> — ^ 

T^OO Q 



( 12 ) 



In other words, as discussed at the end of Section 3.2, in these conditions 
(jU—¥0,T — >oo) the number of flows is evaluated exactly and the utilization 
coefficient is equal to the ideal one, (i.e. N=K). 

To give a physical meaning to the above equations, we note that the effect of the 
STACK protection is taken into account in (10), with respect to the static formula 
(i.e., when p — > 0 ), by means of the term in the denominator (T-Tqfp )ll/2 ■, such 
term is equal to the mean reaction time multiplied by the mean rate of 
departures/arrivals (p). 



4.2 Evaluation of an Upper Bound of the Utilization Coefficient 

In order to evaluate an upper bound of the utilization coefficient, denoted as we 
make the following assumptions: 

• each flow emits according to its minimum emission profile, i.e. it emits r^(T-T^pp) 
bits during the measurement window; 

• we consider the minimum value for the reaction time, i.e. 

• the per-flow STACK contribution is equal to zero (instead of 1/2). 

Consequently, following the same approach used in the preceding section, we 

have: 

Nupp=K, Pupp={Krs)!C (13) 

This is a rather obvious result, reported only to show that the particularization of 
our derivation gives correct results. In other words, the upper bound of the utilization 
coefficient is equal to the ideal one, when K flows are admitted in the system. 
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4.3 Evaluation of a Lower Bound of the Utilization Coefficient 

We first derive the measurement scheme “maximum reaction time” i.e., the 

maximum possible value of T .It is easy to deduce that the value of T is the 

ir react J react.max 

solution of the following equation: 

Nr^T ^ ~^react,max ^ ~ ^ )^S^react,max _ i . j. _j. ^TS _ 7’ 7’ (1^) 

7 ^ n 77 ^ 77, ^ ^ react , max ~ ^ ~^OFF 

r^T-Ejs r^T-Bjs 

This equation is obtained by imposing that the difference between the left value of 
(5), computed assuming that each flow emits at its average rate and no flows depart, 
and the left value of (5) computed in the assumption that the departure of one flow at 
time (T- T , ) is detected after T , seconds. 

In order to evaluate a lower bound to p, , denoted as we make the following 
assumptions: 

• each flow emits according to its maximum emission profile, i.e. it emits rJT+T^^J 
bits during the measurement window; 

• we consider the maximum value for the reaction time, i.e. r , =T-T' 

• the per-flow STACK contribution is equal to one (instead of 1/2). 

Following the same approach of Section 4.1, we obtain: 

_ K(T-Toff) _Nio^rs (15) 

low '} 'Plow ^ 

pT +T + Tqpp( 1- pTqpp ) C 



4.4 Numerical Results 

In this Section, we present a simulation analysis to evaluate the effectiveness of the 
proposed scheme and of the analytical bounds. As discussed above, we make no 
assumptions on the traffic source behavior, beyond the worst case parameters 
supplied by the DLB characterization. However, in order to generate traffic, we have 
to load the DLBs with specific sources. The choice of parameters and performance 
figures has, here, only case study significance. 

We have considered three kind of sources, loading the DLBs: constant rate sources 
(labeled as CBR) emitting at rate 1.7 Kbytes/s, on-off exponential voice sources 
(labeled as EXPO) and MPEG sources. As regards the EXPO sources, during the On 
state (talkspurt) the source emits packets periodically. The On and Off state durations 
are exponentially distributed, with average values of 352 ms and 650 ms respectively. 
The bit rate during the On period is equal to 4 Kbytes/s. Both sources are regulated by 
DLBs with parameters: Pj = 4 Kbytes/s; r^ =1.7 Kbytes/s; = 5300 bytes. The 
MPEG sources will be described in the sequel. We consider a generic router’s output 
link with parameters: link rate, C = 2.048 Mbps; buffer size, 5=53000 bytes. We set, 
as target performance figure, a packet loss probability, equal to 10 ^ According to 
the acceptance rule provided in [10], the corresponding maximum number of 
acceptable flows is X=100. The call arrival rate, modeled as a Poisson arrival process, 
has been set to 1 call/s, and call duration has been drawn from an exponential 
distribution with mean value 4 minutes. This implies that 240 Erlangs are offered to 
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the link, i.e., more than twice the maximum number of calls that can be, in principle, 
simul taneous ly ad mitted ('A '=100). 

In |Fig. 3 i and iO) we show the utilization coefficient, denoted as p, as a 
function of the measurement window size T, for the EXPO and the CBR case, 
respectively. In the two figures, several curves are reported: i) the utilization 
coefficient in the ideal condition of X=100 (labeled "Reference System"); this is an 
horizontal line since it is independent by the measurement process (equal in this case 
to 0.68 and equal to the upper bound evaluated in Section 4.2); ii) the simulation 
results; hi) the value of p evaluated in Section 4.1; iv) the lower bound of p evaluated 
in Sections 4.3. 

The utilization coefficient is lower than the ideal one, due to the stack mechanism 
and to the conservative evaluation of the number of active flows, evaluated according 
to Eq. (5). However, the maximum number of admitted sources is guaranteed to 
remain below the value X=100, representing the maximum number of flows that can 
be admitted without violating QoS requirements. This is an important advantage of 
our (conservative) admission control criterion, and confirms that: (i) hard guarantees 
can be obtained, thanks to the exploitation of the knowledge of the traffic scenario 
(i.e., of the DLB parameters); (ii) our scheme appears to provide an explicit 
performance calibration parameter, i.e., the value K. 

We recall that the decision criterion is not aware of the effective number of the 
admitted sources, but uses a run time estimate. This has the advantage of avoiding 
explicit signaling or using other non DiffServ compliant schemes. In other words, by 
using signaling we could allocate exactly K flows. Since we want to adopt scalable 
procedures, while guaranteeing performance, we are forced to under-estimate K. The 
prices to pay are: (i) a system under utilization with respect to the “ideal” value of K 
(about 15% less), and (ii) the “a priori” knowledge of the traffic regulator parameters 
adopted at the edge of the network. 

In fact, the ultimate target of the GRIP operation is to achieve a link utilization as 
close as possible to the maximum (i.e., K sources), but without ever exceeding this 
value. In order to satisfy the latter requirement, we must accept a throughput 
penalization. We stress also that the performance of GRIP must be evaluated in its 
ability to enforce a specific value of K. The overall throughput efficiency is a direct 
consequence of the selected value of the "tunable knob" K, and can be thus arbitrarily 
adjusted by the network operator. Finally, we verified that, without the stack 
protection, the number of admitted flows c an e xceed K . 

Another consideration regarding |Fig, 3^ and |Fig. T ) is that, due the presence of the 
DLB regulators, the performance of the two different traffic types are very similar. 
For both traffic types the simulation results are close to the analytical estimation of p, 
thus confirming that Eq. 11 can be used to dimension the optimal value of the 
measurement window size. The value of such optimal window is dependent on two 
parameters only: the maximu m silenc e period allowed by the DLB and the call 
departure rate p. As shown in [Fig. 3'P , the utilization coefficient and the accuracy of 
the estimation of p (Eq. 10) increases with the mean flow duration (since T is smaller 
with respect to 7/p, see Section 4.1). T his figure reports results for EXPO sources, 
with the same parameter used for lFig. 3 1 , but with a greater value of 7/p, equal to 720 
seconds. 
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a) Expo case, l/)i=240 sec 




Measurement Window Size (T) 

b) CBR case, l/)j.=240 sec 




c) Expo case, l/)i=720 sec d) MPEG case 



Fig. 3. Utilization Coefficient versus the Measurement Window Size. 



To verify the robustness of our approach with respect to different traffic classes, 
we loaded the DLBs with MPEG sources. The link capacity is now set to C=100 
Mbps and the link buffer size is B=96000 bytes. The values of the DLB parameters 
are: P^ = 579.6 Kbytes/s; r= 54.28025 Kbytes/s; 8^. = 12480 bytes. We set, as target 
performance figure, a packet loss probability, equal to 10'^ According to the 
acceptance rule provided in [9, 10], the corresponding maximum number of 
acceptable flows is ^f=100. The call arrival rate, modeled as a Poisson arrival process, 
has been set to 1 call/s, and call duration has been drawn from an exponential 
distribution with mea n value 4 minutes (the load is of 240 Erls). The results are 
presented in Fig. 3| l. As in the previous cases, the simulation results and the 
theoretical estimation of p are close to each other. The throughput penalization due to 
our approach with respect to the ideal value of K is smaller than in the previous cases. 



Einally, in iFig- 4| we report p as a function of measurement window for different 
values of the mean flow duration 7/p (100, 200, 400, 800 seconds); also shown is the 
limit curve obtained when 7/p tends to infinity, labeled “static case” (see Eq. 12). As 
mentioned in Section 4.1, if the mean flow duration increases, the utilization 
coefficient increases too and its behavior depends only on the measurement window 
size T. In the static case the utilization coefficient tends to the ideal one (Reference 
system) as T increases (see again Eq. 12). 
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Fig. 4. Utilization Coefficient versus the Measurement Window Size T for different values of 
the mean flow duration 1/|0, (100, 200, 400, 800 seconds) 



5 Heterogeneous Traffic Scenario 

To extend GRIP to a heterogeneous traffic scenario (i.e. to a mix of traffic sources 
regulated with different DLB parameters) we need some introductory considerations. 
Let us assume that the sources are divided in I traffic classes, each comprising 
independent and homogeneous sources (i.e., with the same DLB parameters). To go 
further, we note that, in the homogeneous case, the evaluation of the number of 
admissible flows K is equivalent to the evaluation of the so-called equivalent 
bandwidth of each flow, denoted as e. The equivalent bandwidth concept is well 
known in the literature and represents the amount of bandwidth that must be assigned 
to each flow, in a statistical multiplexing framework, so as to reach some performance 
levels. The equivalent bandwidth is typically comprised between the mean and the 
peak bit rate of the traffic source. In our case, e=C!K, where C is the link capacity 
(and where K is evaluated as in [9, 10], once the buffer size B and the performance 
parameters of interest, e.g. P,^_ =10'^ have been fixed). The ideal admission control 
function “accept new setup requests as long as the number of admitted flows N is less 
than K\ can then be rewritten as “accept new setup request as long as the sum of the 
equivalent bandwidth of accepted flows {N*e) is less than C\ In GRIP, due to the 
lack of signaling, we do not know N, so we estimate it; then we make use of the 
equivalent bandwidth concept by comparing the estimated N to K. Finally, we add the 
stack mechanism. In other words GRIP exploits the DLB characterization two times. 
The first time to estimate the number of admitted flows and the second time to decide 
if a new setup request can be accepted. Note that, to this end, each router must be 
implicitly aware of the DLB parameters. 

The equivalent bandwidth concept can be in principle easily extended to the 
heterogeneous case. In [9, 10] the Authors propose an efficient algorithm to evaluate 
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this quantity, which in the assumption of DLB regulated sources is additive even in 
the heterogeneous case. They evaluate the equivalent bandwidth of the i-th traffic 
class C-, { 1 <i< I } as epC/K , where K- is evaluated for each class in isolation, i.e., 
in a homogeneous system. In other words, the limit value K. is the number of flows 
such that a link with capacity C and buffer B, loaded only with traffic belonging to the 
i-th class, offers pre-defined performance levels. With this approach (which is shown 
to be conservative, even if it leads to a loss of efficiency), the ideal admission rule 
remains a simple sum: new setup requests are accepted as long as 



I 



i=l 



( 16 ) 



where Af is the number of admitted flows belonging to class i. 

To extend GRIP to the heterogeneous case, we have to estimate the number of 
admitted flows of each class and then apply the above admission rule. To this en d, we 
can identify three architectural alternatives. The first (trivial) alternative is shown 
|a. Here we assume that each class is handled in a separated way and recognized by 
routers by assigning different pairs of DS codepoints to different traffic classes (i.e., 
each class as a DSCP for Probing packets and another one for Information packets). 
Thus, we need 2*1 different DSCPs and 2*1 different logical queues. Packets 
belonging to class i are classified on the basis of their DSCP tag and dispatched to the 
relevant queue (Probing or Information). However, we do not require to signal 
explicit information about the traffic mix composition, that is how many active flows 
per each class are present in each node. 

The architecture is complex but the extension of GRIP is straightforward, since the 
heterogeneous case is reduced to a combination of homogeneous ones. The admission 
rules becomes: 



Ai(T) 



’’S.iT-Bjs,, 



- + STACK; 



( 17 ) 






with ’^Kjej<C 
i=l 



where STACKj =( l — ( t — tj )/T). In (17), A^(T) is the number of bytes emitted 
during the window T by traffic sources belonging to the i-th class (i.e., at the i-th 
Information queue) and K. is evaluated off-line as discussed above. 

Note that this alternative implies that each node is aware of the DLB parameters of 
all possible classes. This architectural alternative is somehow acceptable only in 
presence of a small number of traffic classes (e.g., a class could be IP telephony), 
even if it is still compliant with the DiffServ approach. An advantage of this 
architecture is that it allows implementing procedures to fairly divide the overall 
capacity among the traffic classes^ 



In the second alternative (see Eigi. _^), we assume that Probing packets belonging 
to different classes are handled in a separated way, with multiple probe queues. These 
packets are recognized by routers by assigning them different DSCPs, while the 
Information packets of all the I traffic classes are multiplexed together in a common 
queue. Each class has a DSCP for Probing packets, while the Information packets of 
all classes share the same DSCP. Thus, we need 7 h- 1 different DSCPs and 7 h- 1 
different logical queues. As above, we do not require to signal explicit information 
about the traffic mix composition. This duty is assigned to the measuring module. 
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which operates on traffic aggregates only. However, the Decision Criterion has to be 
suitable modified with respect to the homogeneous case, in order to take into account 
the presence of different traffic profiles, while still guaranteeing performance. 

In this alternative, it is not possible anymore to estimate the admitted flows of each 
class, N., as done above; thus if we want still to guarantee performance, we are left 
with a worst case approach. Our measurement procedure evaluates now the number 
A(T) of bytes emitted within a window of size T by all traffic classes. To define a 
GRIP allocation rule we have to evaluate the traffic mix that maximizes the overall 
equivalent bandwidth 

/ ( 18 ) 
eroT= X ^ 

i=I 



under the constraints 



iNiar<A{T) with ar=rsjT-Brs,i 
i—1 

0<Ni<Ki 



( 19 ) 



In other words, we have to find the values of { 1 <i<l j under the constraint 
that the admitted flows must be such that they emit A{T) bytes, when using a 
minimum DLB emission profile (this is the meaning of the apex ‘"min” to quantity 

“a”). 

The main difficulty in resolving the problem in (18) is the need to find an integer 
solution, since we are dealing with numbers of flows. We start by resolving the 
associated continuous problem (i.e. by assuming that is a continuous variable) and 
then we apply a floor operator to the solution (which is a slightly conservative 
operation), obtaining; 



N, 



A 

min 

Or 



Nj = 0 for i ^ X 



(20) 



where x is the index of the class with the greatest value of the ratio / a™*” 
n<i<il. 

Because of the lack of any information about the composition of the traffic mix, 
the measurement procedure reduces the heterogeneous case to an homogenous one, in 
which, for estimation purposes, only the traffic class endowed with the worst 
estimation of resource utilization is considered as present in the mix. Such class is 
labeled with the index “x”. The worst estimation results from assigning the greatest 
equivalent bandwidth to a number of bits obtained by assuming the minimum DLB 
emission profile. The GRIP allocation rule relevant to the i-th probing class becomes: 

A ^ €• 

-I- Y.STACK: +-1- 

''S,xT-Bts,x i=l <^x 



< with STACKf = 



1 - 



t-t] 



( 21 ) 
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and the ratio e,- / e,. accounts for the new incoming flow. Note that this architecture 
still allows implementing procedures to fairly divide the overall capacity among the 
traffic classes. As regards performance, this architecture is simpler than the previous 

one, but the price to pay is a potential smaller system efficiency. 

As a third and last alternative, we propose the architecture shown in Fig, 5 :. Here 
we assume that the Probing packets of all the / traffic classes are multiplexed together 
in the same queue and that the Information packets of all the / traffic classes are 
multiplexed together in another common queue. All the traffic classes share the same 
pair of DSCP tag: one for Probing packets and one for Information packets. Thus we 
need only two DSCPs. This time we have to adopt a worst case approach not only for 
the measurement procedure but also for the admission rule. Recall that we use the 
DLB characterization two times. The first one to estimate the number of admitted 
flows and the second time to decide if a new setup request can be accepted. Since in 
this alternative we can not distinguish between probes belonging to different traffic 
classes, we are forced to interpret each setup request as belonging to the class with the 
greatest equivalent bandwidth. Let us define the maximum value of e. { 1 <i < I j . 
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Fig. 5. Heterogeneous case logical scheme, multiple and single probe queue implementations 
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The GRIP allocation rule relevant to the i-th probing class becomes: 



a{t) 



^S,xT-Bts,x 



- + STACK + - 



< K, with STACK = - 



1 - 



t — tr 



(22) 



The ratio accounts for the new incoming flow and A(T) is the number of 

bytes emitted within a window of size T by all traffic classes. The stack variable is 
incremented by using a scaling factor based on the greatest equivalent bandwidth, to 
take into account setting up flows in a conservative way. This last alternative is the 
simplest one. As in the previous ones, we do not require to signal explicit information 
about the traffic mix composition. In addition, this alternative implies that each node 
has to know the DLB parameters of only one traffic class (the worst one) and the 
value (assuming that the considered traffic classes do not vary). 

The prices to pay are: i) a potential smaller system efficiency with respect to the 
previous cases; ii) the impossibility of implementing procedures to fairly divide the 
overall capacity among the traffic classes. We carried out a performance evaluation of 
this scenario similar to the one presented in Section 4. For space limitations we report 
only the conclusions (for details see [3]). In the first alternative, obviously, we have 
no loss of efficiency with respect to the homogeneous case. The second alternative 
presents a loss of efficiency with respect to the homogeneous case, which depends on 
the "distance" between the traffic classes. This effect is even more evident in the third 
alternative. The conclusion, as it could be expected, is that it is convenient to jointly 
handle different classes when they are not too "different" from each other, i.e., when 
their DLB parameters are not too distant from each other. 



6 Conclusions 

In this paper we have presented a scalable Admission Control scheme, called GRIP. 
This is a novel reservation paradigm that allows an evolution from the actual best- 
effort Internet to a future QoS capable infrastructure. In conformance to the DiffServ 
principles, GRIP does not rely on explicit signaling protocols to provide an admission 
control function. Such a function is achieved by imposing each router to be capable of 
distinguish probe packets from data packets and properly enforce a suitable 
scheduling discipline. More specifically, we have proposed procedures that allow a 
tight control on the QoS experienced by admitted flows. A stack mechanism has been 
proposed to avoid temporary overallocation in the presence of impulsive loads. 
Simulation results and analytical performance evaluation have been provided in order 
to provide a thorough dimensioning of the system. 
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Abstract. Endpoint admission control is a mechanism for achieving scalable ser- 
vices by pushing quality-of-service functionality to end hosts. In particular, hosts 
probe the network for available service and are admitted or rejected by the host 
itself according to the performance of the probes. While particular algorithms 
have been successfully developed to provide a single service, a fundamental re- 
source stealing problem is encountered in multi-class systems. In particular, if 
the core network provides even rudimentary differentiation in packet forwarding 
(such as multiple priority levels in a strict priority scheduler), probing flows may 
infer that the quality-of-service in their own priority level is satisfactory, but may 
inadvertently and adversely affect the performance of other classes, stealing re- 
sources and forcing them into quality-of-service violations. This issue is closely 
linked to the network scheduler as the performance isolation property provided by 
multi-class schedulers also introduces limits on observability, or a flow’s ability 
to assess its impact on other traffic classes. In this paper, we study the problem 
of resource stealing in multi-class networks with end-point probing. For this scal- 
able architecture, we describe the challenge of simultaneously achieving multiple 
service levels, high utilization, and a strong service model without stealing. We 
propose a probing algorithm termed e-probing which enables observation of other 
traffic classes’ performance with minimal additional overhead. We next develop a 
simple but illustrative Markov model to characterize the behavior of a number of 
schedulers and network elements, including flow-based fair queueing, class-based 
weighted fair queueing and rate limiters. Finally, we perform an extensive set of 
simulation experiments to study the performance tradeoffs of such architectures, 
and to evaluate the effectiveness of e-probing. 



1 Introduction 

The Integrated Services (IntServ) architecture of the IETF provides a mechanism for 
supporting quality-of-service for real-time flows. Two important components of this 
architecture are admission control tw\ and signaling |14)|: the former ensures that 
sufficient network resources are available for each new flow, and the latter communicates 
such resource demands to each router along the flow’s path. However, the demand for 
high-speed core routers to process per-flow reservation requests introduces scalability 
and deployability limitations of this architecture without further enhancements. 
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In contrast, the Differentiated Services (DiffServ) architecture [EH achieves scala- 
bility by limiting quality-of-service functionalities to class-based priority mechanisms 
together with service level agreements. However, without per-flow admission control, 
such an approach necessarily weakens the service model as compared to IntServ, namely 
individual flows are not assured of a bandwidth or loss guarantee. 

A key challenge addressed in recent research is how to simultaneously achieve the 
scalability of DiffServ and the strong service model of IntServ. Towards this end, several 
novel architectures and algorithms have been proposed. For example, architectures for 
scalable deterministic services were developed in JI3I3. In o. a technique termed 
Dynamic Packet State is developed in which inserting state information into packet 
headers overcomes the need for per-flow signaling and state management. In IH3, a 
bandwidth broker is employed to manage deterministic services without explicit co- 
ordination among core nodes. A scheme that provides scalable statistical services is 
developed in H, whereby only a flow’s egress node performs admission control via 
continuous passive monitoring of the available service on a path. 

While such approaches are able to achieve scalability and strong service models, they 
do so while requiring specific functionality to be employed at edge and/or core nodes. 
For example, m requires packet time stamping and egress nodes to process signaling 
messages; o requires rate monitoring and state packet insertion at ingress points and 
special schedulers at core nodes. Thus, despite that such edge/core router modifications 
may indeed be feasible, an alternate and equally compelling problem is to ask whether the 
same goals can be achieved without any changes to core or edge routers, or at most with 
routers providing simplistic prioritized forwarding as envisioned by DiffServ extensions 
such as class based queueing or prioritized dropping policies. 

This design constraint is quite severe: it precludes use of a signaling protocol as 
well as any special packet processing within core nodes. Such a constraint naturally 
leads to probing schemes in which end hosts perform admission control by assessing 
the state of the network by transmitting a sequence of probe packets and measuring 
the corresponding performance. If the performance (e.g., loss rate) of the probes is 
acceptable, the flow is admitted, otherwise it is rejected. Design and analysis of several 
such schemes can be found in itnaTi . Such approaches achieve scalability by pushing 
quality-of-service functionalities to the end system and indeed removing the need for a 
signaling protocol or any special-purpose edge or core router functions. Moreover, Q 
found that such an architecture is indeed able to provide a single controlled-load like 
service as defined in uni- 

However, can host-controlled probing schemes be generalized to support multiple 
service classes as achieved by both IntServ and DiffServ? In particular, DiffServ supports 
multiple service classes differentiated by simple aggregate scheduling policies (per-hop 
behaviors); DiffServ’s Service Level Agreements (SLAs) provide aggregate bandwidth 
guarantees to traffic classes; IntServ provides mechanisms to associate different quality- 
of-service parameters (e.g., loss rate, bandwidth, and delay) with different traffic classes. 
Can such multi-class service models co-exist with the host-controlled architecture? 

Unfortunately, a resource stealing problem, first described in Q, can occur in multi- 
class systems. In particular, the problem occurs when a user probes within its desired 
class and, upon obtaining no loss (or loss below the class’ threshold), infers that sufficient 
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capacity is available, which indeed it may be within the class. However, in some cases, 
admission of the new probing flow would force other classes into a situation of quality-of- 
service violations, unbeknownst to the probing flow. Such resource stealing, described in 
detail in Section|2 arises from a fundamental observability issue in a multi-class system: 
the performance isolation property provided by multi-class networks also inhibits flows 
from assessing their performance impact on other classes. 

The goal of this paper is to investigate host probing in multi-class networks. Ad- 
dressing the problem of resource stealing, our contribution is threefold. First, we study 
architectural issues and show how service disciplines and the work conservation property 
have important roles in the performance of probing systems. For example, while a non- 
work-conserving service discipline can prohibit resource borrowing across classes and 
remove the stealing problem, such rigid partitioning of system resources limits resource 
utilization. Second, we develop a probing algorithm which simultaneously achieves high 
utilization and a strong service model without stealing. The algorithm, termed £-probing, 
provides a minimally invasive mechanism to enable flows to assess their impact on other 
traffic classes in Class-Based Fair Queueing (CBQ) and strict priority systems. Finally, 
we introduce a simple but illustrative analytical model based on Markov Chains. Using 
the model, we precisely identify stealing states, comparatively analyze several probing 
architectures, and quantify the aforementioned tradeoffs. 

In all cases, we use an extensive set of simulations to evaluate different probing 
schemes and architectures under a wide range of scenarios and traffic types. The experi- 
mental results indicate that e-probing can achieve utilizations close to the limits obtained 
by fair queueing, while eliminating resource stealing. Consequently, if core networks 
provide minimal differentiation on the forwarding path, e-probing provides a scalable 
mechanism to control multiple service classes, achieve high utilization, and provide a 
strong service model without resource stealing. 

The remainder of this paper is organized as follows. In SectionQ we formulate the 
stealing problem in multi-class networks and describe the role of the packet scheduler. 
Next, in Section 0 we propose a simple probing algorithm, termed e-probing, that 
overcomes the observability limitations introduced by multi-class schedulers. In Section 
0we develop an analytical model to study the performance issues and tradeoffs in 
achieving high utilization, multiple service classes, and a strong service model without 
stealing. Finally, in Section 0 we describe an extensive set of simulation experiments 
used to investigate the design space under more realistic scenarios. 

2 Resource Stealing 

The stealing problem arises in multi-class systems in which resources are remotely con- 
trolled by observation. This is in contrast to systems in which resources are controlled 
with explicit knowledge of their load, such as in IntServ-like architectures. In this sec- 
tion, we describe the origins of multi-class stealing and the corresponding design and 
performance issues. Throughout, we consider a general definition of “class” that can be 
based on application types, service level agreements, etc., and with quality-of-service 
parameters such as loss rate and delay associated with each class. 
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2.1 Origins and Illustration 

Probing schemes, such as those studied in II 1 1411711- can be described with an example 
using the network depicted in Figured To establish a real-time flow between hosts H and 
H’ , host H transmits a sequence of prohes into the network at the desired rate (or peak rate 
for variable rate flows). If the loss rate of the probes is below a pre-established threshold 
for the traffic class, then the flow is admitted, and otherwise it is rejected. Scalability 
is achieved in such a framework by pushing all quality-of-service functionality to end- 
hosts, indeed removing the need for any signaling or storage of per-flow state. 

The stealing problem can be illustrated as follows. Consider the following simple 
scenario with two flows sharing a single router with link capacity C and a flow-based fair 
queueing schedule£|(or similarly core stateless fair queueing to achieve scalability 
on the data path). Suppose that the first flow requires a bandwidth of | C and is admitted to 
an initially idle system. Further suppose that the second flow has a bandwidth requirement 
^C. Upon probing for the available service in the fair queueing system, the flow will 
discover that it can indeed achieve a loss-free service with throughput ^ C, and admit 
itself. Unfortunately, while is indeed the fair rate for each flow, the goal here is 
not to achieve packet-level fairness, but rather to achieve flow-level quality-of-service 
objectives. Thus, in this example, abruptly reducing the first flow’s capacity is a clear 
violation of the flow’s service. 




Fig. 1. Illustration of Probing and Multi-Class Stealing 



This simple example illustrates an important point. The ability of fair queueing to 
provide performance isolation can be exploited for both flow-control (to quickly and 
accurately assess a flow’s fair rate) and quality of service (to provide a minimum guar- 
anteed bandwidth to a flow or group of flows). Flow ever, it is precisely this performance 
isolation which introduces the “stealing” problem for scalable services: since the prob- 
ing flow is isolated from the established flows, it cannot assess the potentially significant 
performance impact that it has on them. Consequently, while a new flow can determine 
whether or not its own quality-of-service objectives will be satisfied, it cannot determine 

^ We will discuss both flow- and class-based fair queueing. In class-based, the scheduling disci- 
pline inside each class is FCFS (First Come First Served), and between classes the discipline 
is fair queueing. In flow based each flow is considered as a class. 
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its impact on others. Thus, if admitted, the new flow can unknowingly steal resources 
from previously admitted flows and result in service violations. 

This problem is not limited to fair queueing nor to per-flow schedulers. Consider a 
class-based strict priority scheduler in which a new flow wishes to probe for the available 
service at a mid-level priority. Ideally, the flow could indeed assess the capacity remaining 
from higher priority flows by probing at the desired service level. However, it would not 
be able to assess its impact on lower priority levels without also probing at lower levels. 

Thus, the stealing problem arises from a lack of observability of multi-class networks, 
namely, that assessing one’s own performance does not necessarily ensure that other 
flows are not adversely affected. 

2.2 Problem Formulation 

Within a framework of scalable services based on host probing, the key challenge is 
to simultaneously achieve (1) multiple traffic classes (differentiated services), (2) high 
utilization, and (3) a strong service model without stealing. To illustrate this challenge, 
consider the network of Figure [I] in which each link has capacity C. Further suppose 
the system supports two traffic classes A and B with different traffic characteristics and 
QoS requirements. 

A key design axis which affects these design goals is whether or not the system 
allows resource sharing across classes. This in turn is controlled by the scheduler and 
whether or not it is work conserving. 

Rigid Partitioning without Work Conservation One way to ensure both classes 
achieve their desired QoS constraints is via hard partitioning of system resources with no 
bandwidth borrowing across classes allowed. Such a system can be implemented with 
rate limiters, i.e., policing elements with the peak rate of class i limited to 4>iC with 
4>i + 4>2 ^ 1 . 

Observe that a hard partitioning system can support multiple traffic classes and does 
not incur stealing, thereby achieving the first and third goal above. However, notice that 
the system is non- work-conserving in the sense that it will reject flows even if sufficient 
capacity is available and consequently can under utilize system resources. For example, 
suppose path A- A’ of FigureQJhas a large class-.4 demand and no class-H and vice versa 
on path B-B’. In this case, the system would be under-utilized as only half of the flows 
which the system could support would be admitted. In general, whenever the current 
bandwidth demands are not in line with the weights (f>i, the system will suffer from low 
utilization. 

Inter-class Sharing with Work Conservation In contrast to the scenario above, con- 
sider a work-conserving system which allows one class to use excess capacity from other 
classes. In particular, consider a two-class fair queueing system (without rate limiters) 
with weights cf>i and <^ 2 - With the same demand as in the example above, both A- A’ and 
B-B’ flows can fully utilize the capacity due to the soft partitioning of resources in the 
work conserving system. Thus, the first and second goals are achieved. However, as de- 
scribed in Section ??, such a system suffers from the stealing problem, as a new class-H 
flow on A-A’ or a new class-.A flow on B-B’ will steal bandwidth from established flows. 
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Targeted Behavior The targeted behavior that we strive to achieve is to combine the 
advantages of the hard and soft partitioning systems and allow borrowing across classes 
to achieve high utilization, while eliminating resource stealing to provide a strong service 
model. Thus, in the example, if A-A’ is fully utilized by class-,/! flows, class-,B flows 
(and class-,4 flows) should be blocked until class-,4 flows depart. Below, we develop 
new probing schemes which seek to simultaneously achieve the above three design goals 
and achieve this targeted behavior. This service model is a greedy one, in which all flows 
which can be admitted are, provided that their and all other service requirements can be 
satisfied. This strategy does not incorporate blocking probability as a QoS parameter. It is 
possible to have targeting blocking probabilities, but it is beyond the scope of this paper. 
Throughout the paper we will only consider the admission controlled traffic. Best-effort 
would have a lower priority level so it would not interfere with the admission controlled 
one. Also, guaranteed-like service with strict QoS assurances would have a reserved 
bandwidth and a higher priority. 

3 Epsilon Probing 

In this section, we develop probing algorithms which overcome the stealing problem in 
fair queueing multi-class servers. The key technique is to infer the “state” of other classes 
with minimal overhead in terms of probing traffic or probing duration. Throughout, we 
consider a simplified bufferless fluid model as in 0], in which flows and probes transmit 
at constant rate and probing is “perfect” in the sense that probes correctly infer their loss 
rate as determined by the scheduling discipline (which defines how loss is distributed 
among classes) and the workload (which defines the extent of the loss in the system). 



Hosts 




Fig. 2. Illustration of e-Probing 



Consider a class-based weighted fair queueing server with K classes, where class 
k has weight (j)k and target QoS parameters of loss rate Fk and delay bound dk (for 
simplicity, we restrict the discussion to loss). According to the definition of WFQ, 
the bandwidth utilized by class k when all classes are backlogged is given by Uk — 

. C where S is the set of backlogged classes. Let the demanded bandwidth of 



Resource Stealing in Endpoint Controlled Multi-class Networks 



201 



class k be denoted hy Bk- If Bi + ■■■ + Bk > C, then loss occurs in the system, and 
the loss rate of class k at that particular instant in time is given by 



Ik = {Bk - UkY/Bk. (1) 

With probing in a single level, a new class-fc flow requesting bandwidth bk is admitted 
if its measured loss rate is less than the class requirement, i.e., if 7 ^ < Bk- However, 
observe that even under congestion and arbitrarily high loss rates in other classes, the 
new flow would be admitted as long as 



Bk + bk< - A). (2) 

Thus, stealing across classes can occur as the probing flow fails to observe whether or 
not other classes’ loss requirements are also satisfied. While simultaneously probing in 
all classes may seemingly solve the problem, it is not only unnecessary, but significantly 
damages the performance of the system: namely increased probing traffic forces the 
system to more quickly enter a thrashing regime in which excessive probing traffic 
causes flows to be mistakenly rejected, and in the limit, causes system collapse il- 
We propose e-probing as a probing scheme designed to eliminate stealing in a mini- 
mally invasive way. With e-probing, a new flow requesting bandwidth bk simultaneously 
transmits a small bandwidth e^ to each other class i. The motivating design principle is 
that the impact of the new flow on all classes must be observed, so that the new flow 
is only admitted if 7 ^ < A is satisfied for all z = 1, - ■ ■ ,K. The admissible loss rate 
in each e-probe ( 7 ^) is the same for all classes and globally agreed upon. In particular, 
addition of the new class-fc flows can affect Uk for each class: the e-probes ensure that 
the new Uk is sufficiently large to meet the required loss rate. 

In the fluid model, e^ can be arbitrarily small, whereas in the packet system, it must 
be sufficiently large to detect loss in the class. In the simulation experiments of Sectional 
we consider = 64 kb/sec for a 45 Mb/sec link with flows transmitting at rates between 
512 kb/sec and 2 Mb/sec. 

Finally, we note that despite the utilization advantages of a work-conserving system, 
a network may still contain non-work conserving elements to achieve other objectives 
(e.g., to ensure that a minimum bandwidth is always available in each class, even if 
there is no current demand, cf. FigureEJ. The goal of e-probing is to enable inter-class 
resource sharing to the maximal extent allowed by the system architecture. 

e-probing is applicable to both class-based fair queueing and strict priority sched- 
ulers. In the latter type of scheduler, the e-probes are required only in the priority levels 
lower than the level of the class for the flow that is requesting admission. In higher 
levels stealing cannot occur and no e-probe is required. Therefore, the overhead due to 
e-probes is lower in strict priority than in class-based fair queueing schedulers. 



4 Theoretical Model 

In this section we develop an analytical model based on continuous time Markov chains 
to study the problem of resource stealing in multi-class networks. 
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4.1 Preliminaries 

In the model, each state identifies the number of currently admitted flows in each class 
such that with K classes, the Markov chain has K dimensions. The link capacity is 
C resource units and each flow of class k occupies bk resource units0 We assume 
that new flows arrive to the system as a Poisson process with mean inter-arrival time 
and that flow lifetimes are also exponentially distributed with mean Probing is 
considered instantaneous so that we do not consider the “thrashing” phenomenon (due 
to simultaneously probing flows) described in 0 . 

We define Uk to be the number of class k flows in the system such that the total 
amount of resources occupied by all flows in the system is given by (b • n), where 
b = (6i, • • • , bx), n = {m, ■■■ , Uk), and (b • n) = X)fc=i All classes require 0 
loss probability so that we restrict our focus to multi-class stealing and do not address 
QoS differentiation with this model. 

In the discussion below, we consider an example consisting of two traffic classes so 
that, for example, the transition from state ( 0 , 1 ) to ( 1 , 1 ) signifies admission of the first 
class 1 flow. The link capacity C is 6 resource units and the flow bandwidths are 1 and 
2 , i.e., bi = 1 and 62 = 2 . 

4.2 Markov Models 

Below, we model the different schedulers and probing algorithms which we compare in 
■Section im 



FIFO Here, we consider FIFO as a baseline scenario. In our working example with two 
classes, each flow will probe the network and will be admitted only if biui + < C, 

including the probing flow. Figure Eldepicts the corresponding state transition diagram. 

In the general case, the state space is S' = {n G : (b • n) < C} where I is the set 
of non-negative integers and is the set of all AT-tuples of non-negative integers. The 
link utilization is given by u = ^ ' n)7r(n) where 7r(n) is the probability of 

being in state n, which can be computed using standard techniques [iQi. Notice that the 
probability of stealing is zero, since probing flows are only admitted if there is available 
bandwidth in the link. 



Flow-Based Fair Queueing As described in SectionQ larger bandwidth flows can have 
bandwidth stolen in flow-based fair queueing systems. Figure 0 depicts the system’s 
state transition diagram for our working example. As shown, the state space includes all 
states in which the number of flows multiplied by the lowest bandwidth flow is lower or 
equal to the link capacity. For example, a transition from state (4,1) to (5,1) is possible 
because 1 bandwidth unit is guaranteed for each flow, and this is sufficient for class 1 
flows. Alternatively, a transition from state (5,0) to (5,1) is not possible because class 
2 flows require 2 bandwidth units. Thus, the stealing states represent admissions of 

^ In general the rates of flows that belong to the same class may be different; however this 
assumption is required for the Markov chain formulation. 
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Fig. 3. FIFO Transition Diagram 



low bandwidth flows when the system is at full capacity and high bandwidth flows are 
forced into a loss state. In the state transition diagrams, stealing states are represented 
by a crossed-out state. 




Fig. 4. Flow-Based Fair Queueing Transition Diagram 



Suppose {ni,n2, ...jUk} is the current admitted set of flows. Then a new 

class-fc flow is admissible if {bini+b2n2 + ... + bk{[nk + 1] +Uk+i + ■■■ + uk) < C} 
with{6i < 62 < ... < < ... < 5 rf}. There are two cases: if the total demand satisfies 

{hini+h2n2+---+bk{nk+^)+bk+ink+i+---+bKnK < C}, then all flows are correctly 
admissible; however if {b\ni+b2n2 + ■■■ + hk{rik + ^) + bk+ink+i + ■■■ + bK'nK > C}, 
then stealing occurs. Since the scheduler fairly allocates bandwidth to all flows, flows 
with bandwidth higher than the bandwidth of the flow requesting admission are forced 
into a loss state. 
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For the case of two types of flows, the state space including stealing states is given 
by 

Sfq = {n e : {{bini + biU2 < C) A (62712 < C) A (rii ^ 0 )) 

V((^2772 < C") a (t7i = 0))} 

For example, in our transition diagram, state ( 2 , 3 ) (with ni = 2 7^ 0 ) is a possible 
state (a stealing state) because 6 itii + 61 712 = 5 < C and 62712 = 6 < C. State ( 2 , 4 ) 
is not a possible state since biUi + 61712 = 6 < C but 62772 = 8 > C. This example 
explains the need of the second inequality (62772 < C). With rii = 0 , the state with the 
maximum number of flows is ( 0 , 3 ) because 62772 = 6 < C. 

To generalize the state space to K types of flows, let k* be the smallest k such that 
77fe* 7^ 0, then 

^FQ = {n G : {bk*rik* + bk*nk*+i + + b]^»riK < C) 

A( 6 fc*+i 77 fc.+i + ... + bk-+iriK < C) A ... A {bxnK < C)} 

The mean utilization is then 

77=^ ^ (bon)7T(n) ( 3 ) 

where 



(b o n) 



(b • n) if (b • n) < C 
C otherwise 



The probability of stealing is 

FQ ^ _ Engs(b ■ n)F(n) 



( 4 ) 



( 5 ) 



computed as the percentage of bandwidth guaranteed to flows which is stolen by other 
flows. 



Rate Limiters With rate limiters class k flows are only allowed to use a maximum 
of Ck bandwidth units. Here, we consider rate limiters of Ci = 2 units and C2 = 4 
units respectively. Figure Ejdepicts the corresponding state transition diagram. Given the 
functionality of the rate limiters, the state space is reduced to 

Srl = {n G : bkUk <Ck,k= 1 , 2 , ...K}. (6) 

With this elimination of various high-utilization states, the overall system utilization in 
the general case of K classes, given by 77 = ^ J 2 neSRL reduced 

as compared to work-conserving systems. Clearly, the extent of this utilization reduction 
is a function of the system load, bk and Xk, and Ck ■ If they are properly tuned, the penalty 
will be minimal, whereas if they become unbalanced due to load fluctuations, the system 
performance will suffer. 
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Fig. 5. Rate Limiters Transition Diagram 




Fig. 6. Class-Based Fair Queueing Transition Diagram 



Class-Based Systems In class-based fair queueing without rate limiters, resource bor- 
rowing across classes is allowed. However, as described in Sectional stealing occurs as 
new flows in classes with reserved rate less than cj>iC request admission. Thus, with 1 - 
level probing, a class k flow with bandwidth bk will be admitted if, including the probing 
flow, one of two conditions occurs: 6 ^ 71 ^ < C when (b • n) < C, or bkrik < C(j>k when 
(b • n) > C. In the state transition diagram of Figure^ (^i = 1/3 and (f>2 = 2/3. As an 
example, consider the transition from ( 5 , 0 ) to ( 5 , 1 ). In the state space, the first set of in- 
equalities is not satisfied because 62712 = 2 < C, but 6 i71i -1-62712 = 7 > C. However the 
second set of inequalities is satisfled since 62712 = 2 < C(j)2 and biui + 62712 = 7 > C. 
Therefore this transition is possible. Similarly, the transition from ( 4 , 1 ) to ( 5 , 1 ) is not 
allowed because neither set of inequalities are satisfled. 

The state space has two parts. The first one is equal to the one of FIFO and allows 
borrowing between classes as long as (b • n) < C. Thus Scbq-i includes {n G : 
(b • n) < C}. Suppose we are in one of the edge FIFO states. Due to the borrowing 
between classes, for some classes, bkUk < C(j)k, which we will call the underload classes 
(UL), and for others, 6 ^ 71 ^ > C 4 >k, which we will call the overload ones (OL). Suppose 
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there is currently no stealing in the system. New probing flows from U L classes can be 
admitted until buiriui = C4>ui, with ul G UL, irrespective of the value (b • n). Thus 
departing from an edge FIFO state, new states can be created such that {n e i^,ul e 
UL : buiriui < C(j)ui}- The overall state space Scbq-i is then the union of the FIFO 
state space and the one constructed with these new states. 

The utilization is it = ^ SneScBQ-i ° n)7r(n) and the probability of stealing 



CBQ-l 1 
IS = 1 - 



(b-n)7r(n) 



E, 






(bon)7r(n) 



. Note that (b o n) has the same definition as 



in flow-based fair queueing. Comparing the class-based and flow-based fair queueing, 
observe that the class-based system has a larger number of stealing states than the flow- 
based system. For example, transition from (6,0) to stealing state (6,1) is possible in 
the class-based system whereas state (6,1) does not exist in flow -based fair queueing. 
The reason for this is that the flow based fair queueing system blocks this 7th flow as 
it forces the system into loss. However, the class based system admits this flow since 
the requested rate of 2 bandwidth units is indeed available in class 2, even though it 
forces class 1 into a stealing situation. Regardless, even though there are more stealing 
states in the class-based system, the overall stealing probability is lower (as indicated 
by numerical examples and simulations below) because the fraction of time spent in 
such stealing states is lower in the class based system, so the bandwidth stolen will also 
become lower. 

In contrast to the above 1 -level probing, with e-probing, all classes are probed to 
ensure that no stealing occurs. Here, the admissible states in this scheduler are the same 
as in FIFO, so the state space is the same, as well as the utilization. (We note that the 
utilization in the real system with nonzero probe durations is not the same however.) 



4.3 Numerical Examples 

Here we numerically solve the Markov models for each system described above. With the 
solution to the state probabilities, we compute the utilization and probability of stealing 
using the expressions derived above. We consider the scenario of previous sections 
with a link capacity of 6 bandwidth units. The weights of classes 1 and 2 are 1/3 and 
2/3, respectively. The bandwidths of class- 1 and class-2 flows are 1 unit and 2 units, 
respectively. Class I’s mean flow arrival rate is 8 requests per second while class 2’s is 
5. The mean life time of class 1 flows is 2/3 time units while class 2’s is 1/4 time units. 



Table 1. Utilization and Stealing Probability 



Probing Scheme 


Utilization 


Stealing 


e-probing/FIFO 


0.789 


0 


Flow-FQ 


0.792 


0.140 


Rate Limiters 


0.702 


0 


Class-FQ (1-level) 


0.789 


8.28 • 



We make two observations about numerical examples presented in Tabled First, no- 
tice that e-probing and rate limiters both have the effect of eliminating resource stealing. 
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However, e-probing does so at higher utilizations. For example, e-probing achieves 79% 
utilization as compared to 70% under rate limiters. Moreover, the difference between 
these two utilizations is determined by the relative class demands, which in this context 
are the relative flow arrival rates. 

Second, note the stealing probabilities for flow- and class-based fair queueing (with- 
out £-probing). Here, the stolen bandwidth is 0.140 for flow-based and 8.28 • 10“"^ for 
class-based. As evident from the model, CBQ incurs far less stealing than flow-based 
fair queueing. In simulation experiments, this relative difference still exists, however 
the probability of stealing for CBQ is far greater than it is in these numerical examples. 
The reason for this is even evident from the Markov model. In the CBQ system, stealing 
occurs as classes first demand band widths below and then later above (piC as defined in 
the state space of CBQ-1 level. It is precisely such system dynamics (changing resource 
demands) which are well captured by simulations but less via the Markov model. Thus, 
while the Markov model is useful to explore the origins and structure of multi-class re- 
source stealing, we now turn to simulation experiments to quantitatively explore stealing 
under more realistic scenarios. 



5 Experimental Studies 

In this section, we present a set of simulation experiments with the goal of exploring the 
architectural design space as outlined in Section El evaluating £-Probing presented in 
Sectional and validating the conclusions of the analytical model of SectionElin a more 
general setting. 



Hosts 




The basic scenario is illustrated in Figure El It consists of a large number of hosts 
interconnected via a 45 Mb/sec multi-class router. For some experiments, the router 
contains rate limiters which drop all of a class’ packets exceeding the pre-specified rate. 
We consider several multi-class schedulers including CBQ, flow-based fair queueing, 
and rate limiters. We also consider FIFO for baseline comparisons. New flows arrive 
to the system with independent and exponential inter-arrival times through a Poisson 
process and probe for a constant time of 2 seconds. Flows send probes at their desired 
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admission rate except for e-probes, which are transmitted at 64 kb/sec. New flows are 
admitted if the loss rate of the probes is below the class’ threshold. 





(b) Probability of Stealing 
Fig. 8. Utilization and Stealing vs. Load for Various Node Architectures 



Utilization and Stealing In the first set of experiments depicted in Figure |H1 we inves- 
tigate the challenge of simultaneously achieving high utilization and a strong service 
model without stealing. In this scenario, there are three traffic classes with bandwidth 
requirements of 512 kb/sec, 1 Mb/sec, and 2 Mb/sec respectively. We consider three 
variants of the system depicted in Figure0 The flow-based fair queueing curve, (labeled 
“FQ”) represents the case in which the scheduler allocates bandwidth fairly among flows, 
i.e., the probing flow measures no loss if its rate is less than C /N . In contrast, the 
curves labeled “Rate Limiters 1”, “Rate Limiters 2” and “CBQ 1 level probing” represent 
class-based scheduling. In the former case, each class is rate limited to C/3 so that all 
loss occurs in the rate limiters and none in the scheduler (cf. Figure0. In the latter case, 
the classes are not rate limited and the scheduler performs CBQ with each class’ weight 
set to 1/3. In all cases, probes are transmitted at the flow’s desired rate and e-probing 
of Section 0is not performed. The x-axis, labeled load, refers to the resource demand 
given by 

We make the following observations about the figure. First, comparing the results 
with rate limiters and CBQ, Figure|3a), indicates that CBQ achieves higher utilization 
than rate limiters due to the latter’s non- work-conserving nature. That is, the rate limiters 
prevent flows from being admitted in a particular class whenever the class’ total reserved 
rate is C/3, even if capacity is available in other classes. However, from FigureHb), it 
is clear that the higher utilization of CBQ is achieved at a significant cost: namely, CBQ 
incurs stealing in which up to 1.5% of the bandwidth (in the range shown) guaranteed 
to flows is stolen by flows in other classes. Hence the experiments illustrate that neither 
technique simultaneously achieves high resource utilization and a strong service model. 
Moreover, as the resources demanded by a class become mismatched with the pre- 
allocated weights, the performance penalty of rate limiters is further increased. That is, 
if the demanded bandwidth were temporarily 80/10/10 rather than 33/33/33, as is the 
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case for the curve labeled “Rate Limiters 2” at a load of 40, then the rate limiters would 
restrict the system utilization to at most 53% representing a 33/10/10 allocation. 

Second, observe the effects of flow aggregation on system performance. In particular, 
flow-based fair queueing achieves higher utilization and has higher stealing than CBQ. 
With no aggregation and flow-based queueing, smaller bandwidth flows can always steal 
bandwidth from higher bandwidth flows resulting in both higher utilization since more 
flows are admitted (in particular low bandwidth flows) as well as more flows having 
bandwidth stolen. In contrast, with class based fair queueing, stealing only occurs when 
a class exceeds its 1/3 allocation (rather than a flow exceeding its 1 /N allocation) and 
a flow from another class requests admission, an event that occurs with less frequency. 



e-Probing Figure|2a) depicts utilization vs. load for three cases: CBQ with one-level 
probing, CBQ with e-probing, and rate limiters. Observe that compared to one-level 
probing, e-probing incurs a utilization penalty. There are two contributing factors. First, 
the e-probes themselves cause an additional traffic load on the system despite their small 
bandwidth requirement. Second, by blocking flows which will result in stealing, there 
are fewer flows in the system on average with e-probing than with one class probing. 
Regardless, this moderate reduction in utilization has the advantage of eliminating steal- 
ing completely. Moreover, the utilization penalty of rate limiters can be arbitrarily high 
depending on the mismatch between the demanded resources and the established limits. 
In contrast, the performance of e-probing does not rely on proper tuning of rate limiters, 
but rather the overhead of e-probing simply increases linearly with the number of classes. 





(a) Utilization (b) Overhead 



Fig. 9. Utilization and Overhead of e-Probing 



The utilization reduction solely due to probing is further illustrated in Figure 0b). 
Observe that the overhead incurred in e-probing is necessarily higher larger than that 
incurred by probing in only one class, as e-probing must also ensure that other traffic 
classes are not in overload. However, due to the limited bandwidth required to probe 
in other classes, e-probing incurs moderate utilization reductions typically below 2.5%. 



210 



S. Sargento, R. Valadas, and E. Knightly 



Therefore, £-probing is able to simultaneously eliminate stealing, provide multiple ser- 
vice levels, and enable full statistical sharing across classes. 

6 Conclusions 

Placing admission control functions at the network’s endpoints has been proposed as 
a mechanism for achieving per-flow quality-of-service in a scalable way. However, if 
routers perform class differentiation such as multiple priority queues, the system be- 
comes less observable to probing flows, precisely because of the performance isolation 
provided by the service discipline. In this paper, we have studied the resource stealing 
problem that arises in such multi-class networks and developed a simple probing scheme 
termed £-probing which attains the high utilization of work-conserving systems while 
preventing stealing as in non-work-conserving systems with hard class-based rate limits. 
We introduced a Markov model that illustrated the design space of key network differ- 
entiation mechanisms, such as class- and flow-based weighted fair queueing and rate 
limiters. The model showed the different ways that stealing is manifested in the different 
conhgurations and provided a tool for formal comparison of diverse systems. Finally, 
our simulation experiments explored the design space under a broader set of scenarios. 
We quantified the severity of bandwidth stealing and found that £-probing eliminates 
stealing with a modest utilization penalty required to observe the impact of a new flow 
on other traffic classes. 
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Abstract. In this work we study pricing as a mechanism to control large 
networks. Our model is based on revenue maximization in a general loss 
network with Poisson arrivals and arbitrary holding time distributions. 
In dynamic pricing schemes, the network provider can charge different 
prices to the user according to the current utilization level of the net- 
work. We show that, when the system becomes large, the performance 
of an appropriately chosen static pricing scheme, whose price is indepen- 
dent of the current network utilization, will approach that of the optimal 
dynamic pricing scheme. Further, we show that under certain conditions, 
this static price is independent of the route that the connections take. 
This result has the important implication that distance-independent pric- 
ing that is prevalent in current domestic telephone networks in the U.S. 
may in fact be appropriate not only from a simplicity, but also a perfor- 
mance point of view. We also show that in large systems prior knowledge 
of the connection holding time is not important from a network revenue 
point of view. 



1 Introduction 

In the last few years there has been significant interest in using pricing as a 
mechanism to control communication networks. The network uses the current 
price of a resource as a feedback signal to coerce the users into modifying their 
actions (e.g. changing the rate or route). Price provides a good control signal 
because it carries monetary incentives. 

Past works in the literature differ in their schemes to compute the price. Some 
of the proposed schemes use different forms of auctions, e.g., “Smart Market” fO] 
or “Progressive Second Price Auction” |Zj to reach the right price. Other schemes 
use differential equations at the resource and at the users to converge to such 
prices 0. A common theme behind this category of research is to have the 
resource calculate price based on the instantaneous load and available capacity. 
If the current condition changes, a new price is calculated. The users are expected 
to act on up-to-date price information to achieve the desired design objectives. 
Feedback delays to the user may introduce undesirable effects on the performance 
of the system, such as instability or slow speed of convergence 0. In practice, 
this feedback delay may be difficult to control because calculating up-to-date 
price information continuously consumes a large amount of processing power at 
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the resource/node. Also, there is a substantial communication overhead when 
users try to obtain these up-to-date prices. 

Pricing over multiple resources also poses additional problems. Usually the 
price is calculated over a given route by summing over the price of all the re- 
sources along that route ^ 0 E]- The user has to act on the sum. This kind 
of route-specific or distance-specific price may introduce substantial signaling 
overhead in real networks. 

Interestingly, in our daily lives, we observe quite the opposite phenomenon. 
For example, we do not see the prices of products we purchase everyday fluctuate 
substantially, but in fact they change very slowly. Another example of distance- 
neutral pricing is the domestic long distance telephone service in the US. Most 
long distance companies offer flat rate pricing to users calling anywhere within 
the continental US. 

These observations have motivated a number of recent works. In H2|. after 
a critique on the optimality-pricing approach, the authors have proposed two 
approximations of congestion pricing: the first approximation was to replace 
the instantaneous congestion condition^ by the expected or average congestion 
conditions, similar to time-of-day pricing. The second approximation was to 
replace the cost of the actual path the flow would traverse through the network 
with the cost of the expected path; where the charge depends only on the source 
and destination(s) of the flow and not on the particular route taken by the flow. 
However, in this work, there is no analysis of the performance implication of 
such approximations. 

In El, the authors investigate this problem in the case of a single resource 
with Poisson arrivals and exponential service times. In particular, the authors 
study the expected utility and revenue under both dynamic pricing and static 
pricing schemes in a dynamic network. By a dynamic network, we mean that 
in their system, call arrivals and departures are taken into account. This is 
different from the work in 0130111 where the authors view congestion control 
as a distributed asynchronous computation to maximize the aggregate source 
utility. In these works, the optimization is done over a snapshot in time, i.e., it 
does not take into account the dynamic nature of the network. In contrast, in 
the authors model the dynamic nature of the system by assuming that calls 
arrive according to a Poisson process and stay in the system for an exponentially 
distributed time. The authors study the expected utility and revenue under both 
dynamic pricing and static pricing schemes. It is shown that when the capacity 
is large, using a static pricing scheme suffices, in the sense that the utility (or 
revenue) under an appropriately chosen static price converges asymptotically 
to that of the optimal dynamic scheme. The implication of this work is that 
we can now base pricing decisions only on the average load rather than on the 
instantaneous load, thus reducing both processing and communication overhead. 

This paper extends the results in E! in two directions: 



^ Note that the congestion conditions in El can be obtained from the instantaneous 
load and available capacity in the system. Hence, these terms can be viewed inter- 
changeably. 
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1) We show that the same type of invariance results hold for general loss net- 
works (i.e., the performance of the optimal static pricing scheme asymptotically 
approaches the performance of the dynamic scheme) . The results hold even if the 
service time distribution is general. We show that the right static price depends 
on the service time distribution only by its mean. These extensions have two 
implications: Firstly, while the assumption of Poisson arrivals for calls or flows 
in the network is usually considered reasonable, the assumption of exponential 
holding time distribution is not. For example, much of the traffic generated on 
the Internet is expected to occur from large file transfers which do not conform 
to exponential modeling. By weakening the exponential service time assumption 
we can extend our results to more realistic systems. Secondly, in the general 
network case, we can show that in our revenue-maximization problem, under 
certain conditions, the static price only depends on the price elasticity of the 
user, and not on the specific route or distance. This indicates that the fiat pric- 
ing scheme used in the domestic long distance service in the US may, in many 
cases, be good enough. 

2) Thus far, when we refer to dynamic pricing schemes, we mean schemes 
that take into account the current congestion information. We now consider an 
even broader set of dynamic pricing schemes and show that the performance of 
the static pricing scheme suffices for large systems. In this broader set of dynamic 
pricing schemes, the network has prior knowledge of the individual service time of 
the connections when they arrive. The question is whether we can gain additional 
advantage by pricing the resource based on this additional information. Our 
analysis shows that when the system grows large, this additional information 
does not result in significant gain. We show that the individual service time is 
inessential, and a static price based on the average service time suffices. 

We believe that these results are important in understanding how to use 
pricing as a mechanism for controlling real networks. The possibility of using 
static and/or route-independent prices will give rise to more efficient and realistic 
algorithms. Also, “pricing” here can be interpreted as a general signal which is 
only loosely related to actual pricing. For example, pricing models have been 
used as a mechanism to understand resource allocation and congestion control. 
Therefore the results we present here will help us in better understanding how 
to control large networks. 

1.1 Related Work 

The ideas of upper bounding the performance of the optimal dynamic policy 
in a general loss network and showing that fixed/static policy approaches the 
upper bound asymptotically have been reported in the past, for example, uniisi, 
and the reference therein. In their work, the objective to be optimized is a linear 
function of the number of users in the system, i.e., the utility value. This objective 
function maximizes the utilization of the network, but does not consider the 
revenue generated, hence there is no notion of the price of a resource. Maximizing 
the utility in a system corresponds to a linear optimization problem. However, 
revenue maximization is more difficult to evaluate because it is a non-linear 
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optimization problem. Moreover, users can easily take advantage of the system 
by providing an incorrect utility value when price is not an issue. Pricing is a 
mechanism to coerce users into using the amount of resources that they need (or 
can afford!). Recently, in m Paschalidis and Tsitsiklis have investigated the 
problem of revenue maximization. Our work extends the result of El to general 
loss networks and general service time distributions. Further, our proof of the 
stationarity and ergodicity of such types of system with feedback control is new, 
and of value in its own right. Our treatment of pricing schemes based on the call 
duration sheds insight on the optimal dynamic policy. 

The rest of the paper consists of three parts. In Sect. 2, we investigate the 
case of a general loss network with Poisson call arrivals and general service time 
distribution. In Sect. 3, we study the implication of individual service time on 
the price and in Sect. 4, we conclude. 



2 General Loss Network, General Service Time 

2.1 Model 

Consider an abstract network providing service to multiple classes of users. There 
are L links in the network. Each link I = {1, ...,L} has capacity RK There are / 
classes of users. For users of each class i, there is a route through the network. 
The routes are characterized by a matrix {Cl,i = = 1,...,L}, where 

Cj = 1 if the route of class i traverses link I, C\ = 0 otherwise. Each class i 
connection consumes bandwidth r^. A call is rejected and lost if any of the links 
it traverses does not have enough capacity to accommodate it, otherwise it may 
be admitted to the network depending on the network policy. 

Calls of class i arrive to the network according to a Poisson process with rate 
Xi{ui), which is a function of the price Ui charged to users of class i. Here Ui 
is defined as the price per unit time of connection. We assume that Xi{ui) is a 
non-increasing function of Ui. Therefore Xi{ui) represents the price- elasticity of 
class i. Once admitted, the call will hold the resources on all links it traverses 
until it finishes service. The service time distribution is general with mean 1/ p,i- 

The network provider can charge different prices for different classes of users. 
A dynamic price is one where charges are based on the current utilization of the 
resource, for example, how many calls of each class are already in service, and 
how long they have been served, etc. On the other hand, a static price is one 
which only depends on the class of the user and is indifferent to the current 
utilization of the resource. 



2.2 Stability of the Model 

Our first result is regarding the stability of such a system. Note that our model 
is very similar to a network of M/G/N/N queues. However, here we also have 
“feedback” introduced by the price itj, which makes the model somewhat more 
complex. The arrival rate is changing over time. It is not intuitive to even define 
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stationarity and ergodicity for the input! However, as we will see next the system 
will be stationary and ergodic under very general conditions. 

Proposition 1. Assume that the arrival rates Xi{u) are bounded above by some 
constant Aq for all classes i, the service times are i.i.d. with finite mean and 
independent of the arrivals. If the price is only dependent on the current state 
of the system, then any stochastic process that is only a function of the system 
state is asymptotically stationary and the stationary version is ergodic. 

Proof. To show this, let us first look at the case of a single resource with users 
coming from a single class i. To develop the result, we need to take a different but 
equivalent view of the original model. In the original system, the arrival rate is a 
function of the current price. In the new but equivalent system, the arrival rate is 
constant but is thinned by a probability as a function of the price. Specifically, in 
the new model, the arrivals are Poisson with constant rate Aq. Each arrival now 
carries a value v that is independently distributed, with distribution function 
P{v > u} = Ai(u)/Ao. The value v of each arrival is independent of the arrival 
process and service times. If u < u, where u is the current price at the time of 
arrival, the call will not enter the system. We can see that with this construction, 
at each time instant, the arrivals are bifurcated with probability of success equal 
to P{v > u} = Xi{u)/Xo. Therefore the resultant arrivals (after thinning) in the 
new model are also Poisson with rate AqP{u > u} = Xi{u). Thus the model is 
equivalent to the original model. 

To show stationarity and ergodicity, we need to construct a so called “re- 
generative event” PJ. Let — oo < n < oo, be the n-th arrival’s 

interarrival time, service time, and value, respectively (note this is the arrival of 
the Poisson process before bifurcation.) Let 

k-l 

Qn = 1{<-1 < <} -k 1{<_2 < < -k T®_i} -k ... -k T^_j} + ... 

3=0 

Also let An = {Qn — 0}. Then An can be interpreted as the event that “all 
potential arrivals before time n have left the system.” The event An is regen- 
erative, that is, if event A„ occurs, then after time n, the system will evolve 
independently from the past (this is true because we assume that the price is 
only dependent on the current state of the network). The events A„ are station- 
ary, i.e., if we define T as the shift operator, ,u„.} £ Bi,i = l...fc} = 

G Bi,i= l...fc}, then An = T”Ao,and P{An} = P{Ao}. 
Note here we have eliminated the dependence on both Vn and the price u. Now 
to proceed with the proof, we need the following lemma. 

Lemma 1. Let the sequence of service times r® be i.i.d., and E[t®] < oo, then 
P[A^} > 0 . 

Proof. See Appendix. 

Now by Borovkov’s Ergodic Theorem, PJ, the distribution of the state of 
the system converges as n — > oo to the distribution of the stationary process. 
Ergodicity follows from the lemma below. 
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Lemma 2. The regenerative event A„ is positive recurrent, i.e., let 

Ti — inf{X„ G A„}. Then E{Ti\Xq G Aq} < oo, where is the state of the 

system at time n. 

Proof. See Appendix. 

Since the regenerative event is positive recurrent, the state of the system is 
both stationary and ergodic, i.e., any random process that depends only on the 
state of the system is both stationary and ergodic m- 

For the case of multiple classes and multiple links, assume there are / classes. 
We can construct the equivalent system in the following way: we first construct 
Poisson arrivals with rate IXq. Each of these arrivals is assigned to class i with 
probability 1//, and each of these assignments is independent of each other. The 
service time is then generated according to the service time distribution of class 
i. Each class i arrival carries a value v that is independently distributed, with 
distribution function Pi{v > Ui\ = Xiinf) /\ q. The value v of each arrival is 
independent of the arrival process and service times. If v < ui where Ui is the 
current price for class i at the time of the arrival, the call will not enter the 
system. Following the same idea as in the first paragraph of the proof, it is easy 
to show that such a constructed system is equivalent to the original system. 

The initial Poisson arrivals with rate IXq can be interpreted as “all potential 
arrivals from all classes.” Let be the n-th arrival’s interarrival time and 

service time respectively. It then follows that the sequence of service times r® 
is again i.i.d. with finite mean, and it is independent of the arrivals. Hence, we 
can construct the event Aq as before, which is now the event that “all potential 
arrivals from all classes have left the system before time n.” Again this event is 
the “regenerative event” for the system, and we can show that P{Aq} > 0, and 
Aq is positive recurrent. Therefore, the system is asymptotically stationary and 
the stationary version is ergodic. □ 

2.3 Dynamic Price, Static Price, and Upper Bound 

In this section, we consider dynamic pricing schemes that are based on the 
current occupancy of the network resources. Let n = {ui,i = 1,...,/} represent 
the state of the network, where is the number of users from class i that 
are being served in the network. Let 17 denote the set of possible states, then 
12 = {n : 'Yhi'^^iriCl < i?* for all 1}. Let n{t) denote the state at time t, then 
the dynamic price for class i can be written as Ui{t) = gi{n(f)), where gi is a 
function from 17 to the set of real numbers R. Let g = {gi,i = 1, I}. 

The expected revenue achieved by any dynamic pricing scheme is given by 



From stationarity and ergodicity established in the last section, we have 





Ti. 



1 
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where the right hand side is independent of t (because of stationarity) . 
Therefore, the optimal dynamic policy is 

\i{Ui{t))Ui{t)— . 

_ 

On the other hand, for static pricing schemes, the expected revenue per unit 
time is: 

I I 

Jq — ^ ^ (1 Ploss^i ) 5 

where Pioss,i[u\ is the probability of loss for class i when the price vector is 
u = [mi, Therefore the optimal static policy is 



J* = max E 



Js = max 



/ 

E 

2=1 



(1 Ploss,i['^]) ■ 



By definition Js < J*- 

Here we show that the upper bound in in is also an upper bound for our 
case. For convenience, we write as a function of A^. Let Fi(Aj) = XiUi{Xi)-P. 
Let Jub be the optimal value of the following nonlinear programming problem: 



max 

Ai 




(1) 


subject to 


Xi = ^.iUi for all i 


(2) 




UiViCl < R} for all 1. 





i 



Proposition 2. If the functions Fi are concave, then J* < Jut- 

Proof. We follow the same method as in the proof of Theorem 6 in HU . Consider 
an optimal dynamic pricing policy. Let ni (t) be the number of calls of class i in 
the system at time t. View Xi(t) and ni(t) as random variables. From Little’s 
Law, we have 

E[n,{t)] = E[\{t)]-. 

At any time t, ni{f)riC\ < P} for all 1. Therefore 



Y,E[n,{t)r,Cl]<E 



'^n,{t)r,C\ 



< 



for all 1. 



Therefore Xi = A[Ai(t)] and Ui = E[m{t)] satisfy the constraint of Jub- 
Using the concavity of F and Jensen’s inequality, we have 



( 3 ) 



Jub > E^* 

i i 



□ 
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2.4 The Many Sources Regime 

Consider the regime of many small users. Let c > 1 be a scaling factor. We 
consider a series of systems scaled by c. The scaled system has capacity = cR, 
and the arrivals of each class i has rate Af(u) = cXi{u). Let J*’°, 7° and 
be the dynamic revenue, static revenue, and upper bound, respectively, for the 
c-scaled system. 



Proposition 3. If the functions Fi are concave, then 

lim - Jg = lim - J*’° = lim - JL 

C—¥00 C c—¥oo C C—^OO C 



Proof. Firstly, is obtained by maximizing ^^cXiUi(Xi)/ Hi, subject to the 
constraint '^iCXiriClI^i < cR} for all 1. Therefore the optimal price is inde- 
pendent of c, and = cJub- 

Now consider Jf, for every static price falling into the constraint of Jub, 



i.e.. 



E 






for all I, 



( 4 ) 



let Jg denote the revenue under this static price. We will show that as c — >■ oo. 



j; 



lim LL ^ 






* V* 



( 5 ) 



If we take the optimal price of the upper bound as our static price, then the 
right hand side of is exactly the upper bound. Therefore, 

TC JC 

lim — > lim — > Jub- 

c—^oo C c—^oo c 



On the other hand, Jf < < cJ„h, and the result follows. 

Now we show (EJ. The key idea is to use an insensitivity result from |2| . In 
P], Burman et. al. investigate a blocking network model, where a call instanta- 
neously seizes channels along a route between the originating and terminating 
node, holds the channels for a randomly distributed length of time, and frees 
them instantaneously at the end of the call. If no channels are available, the call 
is blocked. When the arrivals are Poisson and the holding time distributions are 
general, the authors in E| show that the blocking probabilities are still in prod- 
uct form, and are insensitive to the call holding-time distributions. This means 
that they depend on the call duration only through its mean. 

Our system is a special case of |^. Let n = {uj,j = 1,...,/} be the vector 
denoting the state of the system, and let Xj be the arrival rate of class j under 
the static price Uj, i.e., Xj = cXj(uj). Let pj = Xjjpj. From |^, we have the 
blocking probability of calls of class i as: 



P, 



loss 



ner' j 



E Up7/n,l ’ 

■neTo j 



( 6 ) 
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where 



r' = < rijrjCj < cB} for all I and there exists an I 

[ i 

such that C- = 1 and ^ njTjCj > cB^ — Vi 

3 

Iq = ^ 

Since from m we can see that the blocking probability is exactly the same 
as in the case of exponential service times, we now only need to look at the case 
of exponential service times. 

Consider an infinite channel system with the same arrival rate and holding- 
time distribution. Let nyoo be the number of flows of class j in the infinite 
channel system. Further let rioo = {nj^oo,j = be the vector denoting the 

state of the infinite channel system. We can then rewrite , as 




where 



pC,oo 



pc _ pC.oo / pC,oo 

^loss,i ^ r' ' Pq 1 



i: Op” 

fiooCAl 3 



I '^3,^ 






oPjj Pj 



and 



pC,oo _ 
r-p, — 



n^^r' 3 



oPjj Pj 



are the probabilities that {rioo £ do} and {rioo £ B'}, respectively in the infinite 
channel system. 

We will use the estimate of and Pp^ to bound Pp,gg j . In the infinite 
channel system, there is no constraint. Therefore the number of flows nj^ao in 
class j is Poisson (from well known M/M/oo result) and independent of the 
number of flows in other classes. We can view each rij^ao ns a sum of c independent 
random variables. 

First we calculate the first and second order statistics of rij^oo- 

— C “, fJ [Uj^oo] — C ". 

Now by invoking the Central Limit Theorem, as c — >■ oo, we have 



>■} 

— c— 

flj 



V~c 



N(0,^) 

fj,j 



in distribution. 



( 7 ) 



Let x'P = JPj Uj.oor jCj be defined as the amount of resource consumed at 
link I in the infinite channel system. We have 
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Therefore 



c '-’j 



— ^ N{0, Q) in distribution, 



A, 



where Q = ^ C;. 

Now since c^rjCj < cR^ for all I 



lim inf = lim inf P I , for all I 

C—^OO ^ C—^OO I C 



( 8 ) 



> liminfP < — ^ for all I > (by (0) 

c^oo c ^ ■> J 

> lim inf P \njoo < c— , for all j \ (by definition of x'^) 

c^oo ( ’ /ij J 

= liminf]Tp|nj oo < c— 1 > 0.5''^ (by (0) , 
c^oo -IJ- [ 



j.C,l 

limsupPp,°° = lim sup P -( — ^ < P^,for all Z,and there exists I 

c I 

such that C- = 1 and > R} — — 
* c c 



< limsup Vp<^ < P™,for all m, and ^ > P' - 

c-s-oo ; I C ^ 

< limsup^ P |p' - - < < P' 

(by ®) 



< lim sup , — , 

c^oo 

= Eo = 0- 



3 



( 9 ) 



Therefore 

and 



lim = lim ppr/p;f = 0, 

c—¥oo c—¥oo 



hm hm lhny]A.(<K-(l-PL..,) = E^*(<K-- 

c— >-oo c c-^oo c c-^oo jj,^ ’ I_ij^ 



Thus the result follows. 
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The above proof not only shows that the limit converges, but also shows that 
the speed of convergence is at least To see this, we go back to 0. The 
convergence is the slowest when is satisfied with equality, in which case 



E 





1 1 Tj 



In fact, we can show that if the maximizing price of the upper bound falls 
inside the constraint 0, (instead of at the border), then the speed of convergence 
is exponential. 

Here we report a few numerical results. Consider the network in Fig. E There 
are 4 classes of flows. Their routes are shown in the figure. Their arrivals are 
Poisson. The function Ai(u) for each class i is of the form 



Ai(u) 



r / 


U \ 


'^max,i \ 


^ 


L V 


/ _ 



i.e., A.(0) = Amax,i and Xi{Umax,i) — 0 foi’ SOme constants Ajnax,i and Urnax,i 
The price elasticity is then 



A'(») 

Xi{u) 



^l'^max,i 
1 '^/ '^max 



,2 



for 0 < Zi < '^max,i' 



The mean holding time is 1/^i- The arrival rates, price elasticity, service rates 
fj,i , and bandwidth requirement are shown in Table E 




Fig. 1. The network topology 



First, let us consider a base system where the 5 links have capacity 10, 10, 5, 
15, and 15 respectively. The solution of the upper bound |H) is shown in Table El 
The upper bound is Jub = 127.5. We then use simulations to verify how tight this 
upper bound is and how close the performance of the static pricing policy can 
approach this upper bound when the system is large. We use the price induced 
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Table 1. Traffic and price parameters of 4 classes 





Class 1 


Class 2 


Class 3 


Class 4 


^max,i 


0.01 


0.01 


0.02 


0.01 


'^max,i 


10 


10 


20 


20 


Service Rate 


0.002 


0.001 


0.002 


0.001 


Bandwidth 


2 


1 


1 


2 



Table 2. Upper bound in the constrained case, Jub = 127.5 





Class 1 


Class 2 


Class 3 


Class 4 


Ui 


9.00 


5.00 


12.00 


10.00 


\i 


0.00100 


0.00500 


0.00800 


0.00500 


j 


0.500 


5.00 


4.00 


5.00 



by the upper bound calculated above as our static price. We first simulate the 
case when the holding time distributions are exponential. We simulate c-scaled 
versions of the base network where c ranges from 1 to 1000. For each scaled 
system, we simulate the static pricing scheme, and report the revenue generated. 
In Fig. El we show the normalized revenue Jq/c as a function of c. 



Jo/c 




c: scaling 

Fig. 2. The static pricing policy compared with the upper bound: the constrained case. 
The dotted line is the upper bound. 



As we can see, as the system grows large, the performance gap between the 
static pricing scheme and the upper bound gets smaller and smaller. Although 
we do not know what the optimal dynamic policy is, its normalized revenue 
J*/c must lie somewhere between that of the static policy and the upper bound. 
Therefore the performance gap between the static pricing scheme and the op- 



224 



X. Lin and N.B. Shroff 



timal dynamic scheme also gets very small. For example, when c = 10, which 
corresponds to the case when the link capacity can accommodate around 100 
flows, the performance gap between the static policy and the upper bound is less 
than 7%. Further the gap decreases as l/\/c. 

Now, we change the capacity of link 3 from 5 to 15. The solution of the upper 
bound is shown in Table 01 



Table 3. Upper bound in the unconstrained case, Jub = 137.5 



Class 1 1 Class 2 1 Class 3 1 Class 4 



Ui 


5.00 


5.00 


10.00 


10.00 


\i 


0.00500 


0.00500 


0.0100 


0.00500 




2.50 


5.00 


5.00 


5.00 



The upper bound is J* = 137.5. The simulation result (Fig.EJ confirms again 
that the performance of the static policy approaches the upper bound when the 
system is large. At c = 10, the performance gap between the static policy and the 
upper bound is less than 10%. Also note that when the system is in an uncon- 
strained state, the price in our static scheme is the same for users with the same 
price-elasticity even if they traverse different routes. For example, classes 1 & 2 
and classes 3 & 4 have the same price (and price-elasticity) but have different 
routes. In general, if there is no significant constraint of resources, the maximiz- 
ing price structure will be independent of the route of the connection. To see 
this, we go back to the formulation of the upper bound ([Ql. If the unconstrained 




Fig. 3. The static pricing policy compared with the upper bound: the unconstrained 
case. The dotted line is the upper bound. 
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maximizer of Fi(Xi) satisfies the constraint, then it is also the maximizer of 
the constrained problem. In this case the price only depends on the function 
Ui{Xi), which represents the price elasticity of the users. Readers can verify that 
in our second example when the capacity of link 3 is 15, if we lift the constraints 
in and solve the upper bound again, we will get the same result. Therefore 
in our example, the optimal price will only depend on the price elasticity of each 
class and not on the specific route. Since class 1 has the same price elasticity as 
class 2, its price is also the same as that of class 2, even though it traverses a 
longer route through the network. That also explains why in some scenarios, for 
example, state-to-state long distance in US, prices are flat. 

We also simulate the case when the holding time distribution is deterministic. 
The result is the same as that of exponential holding time distribution. The case 
with heavy tail holding time distribution is more complicated. Here is a result 
when the holding time distribution is Pareto, i.e., the cumulative distribution 
function is 1 — l/x“, with a = 1.5. This distribution has finite mean but infinite 
variance. We use the same set of parameters as the constrained case above, 
and let the Pareto distribution have the same mean as that of the exponential 
distribution. 

In Fig. 0 we can see that even with the Pareto distribution the performance 
of the static policy still follows the same trend as the case of exponential holding 
time distribution. It demonstrates that our result is indeed invariant of the hold- 
ing time distribution. However, we also note that if the holding time distribution 
is Pareto with infinite variance, the sample path convergence becomes very slow. 
The problem is even worse when the system is large. We will briefly discuss this 
in the conclusion. 




Fig. 4. The static pricing policy compared with the upper bound: the constrained case 
with Pareto distribution. The dotted line is the upper bound, ‘x’ and ‘-I-’ are 0.99 
confidence interval. 
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3 Pricing on the Incoming Traffic Parameters 

Our previous result shows that dynamic pricing based on the instantaneous 
utilization does not provide us with significant improvement over static pricing, 
when the system is large. The question to then ask is: If we base our dynamic 
price on some other factor, can we outperform the static pricing scheme? In this 
section, we investigate pricing based on another factor, namely the duration of a 
connection. Thus, here we look at the case when the dynamic pricing scheme, in 
addition to the instantaneous load, also has prior knowledge of the call duration. 

For convenience we restrict ourselves to the single resource and single class 
model. The case of multiple resources and multiple classes can be treated anal- 
ogously. Assume flows consume unit bandwidth. The only difference from the 
system studied in Sect. 2 is that now the provider tries to price the incoming call 
according to additional information, i.e., the duration of the call, (here we as- 
sume that each connection request knows how long it has to last). Therefore the 
price becomes u{t,T) = g{n(t),T), where T reflects the length of the incoming 
call and g is a function from H x R to R. 

The question again is: what is the relationship between such a dynamic pric- 
ing scheme and a static pricing scheme? The static pricing scheme can now either 
take T into consideration (i.e., price is u{T)), or not (i.e., price is u, a constant). 
In this section we will show that the performance of the static pricing scheme 
using price independent of T will approach that of the optimal dynamic pricing 
scheme when the system is large. Therefore there is really no incentive to price 
according to individual holding time in a large system. 

We consider the special case when the range of service time can be partitioned 
into slots [Tfc, Tk+ATk), k = 1,2, ..., and at each time instant the price is constant 
within each slot, i.e., u(t,T) = Uk{t),u{T) = Uk, for T G [Tk,Tk + ATk). 

Let Tk = E {T\T G [Tk,Tk + ATk)}, &nd let pk = P{T G [Tk,Tk + ATk)}. 
Our idea is to decompose the original arrivals into a spectrum of substreams. 
Substream k has service time [Tk,Tk + ATk). Its arrival rate is thus \{u)pk. 
Here, we assume that the price-elasticity of calls is independent of T. Therefore 
the arrival of each substream k is still Poisson. The expected dynamic revenue 
is then given by 

OO 

J* = maxE \{uk{t))uk{t)fkPk 

U-1 

The expected static revenue is 

OO 

Js = maxy^ X{uk)ukfkPk (1 - Pioss)- 

k=l 

where Pioss is the probability that a call is blocked given the static prices Uk. 
Note that this probability is independent of k. 

Let Xk = X{uk). Again we write Uk as a function of A^. Also let P{Xk) = 
Xku{Xk). We have. 
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Proposition 4. If F is concave, then J* is upper bounded by Jub, which is the 
solution of the following optimization problem: 

OO 

Jub = max 'Y' F{Xk)fkPk 

OO 

subject to XkTkPk < R 
k=l 

Also when the system is sealed by c, i.e., X^(u) = cA(tt) and = cR, we have 
linic-^oo — limj,_).oo ^*’°/c = linic_>oo Jub!^- Hence, the static price of the 
form u{T) =Uk,T G [Tk,Tk + ATk) suffices. 

Proof. If we view each substream as a class, the only difference from the system 
we studied in last section is that it has countably infinite number of classes. Since 
we can still use Little’s Law for each substream, the first part of this proposition 
can be shown by using the same idea as in the proof of Proposition El and by 
invoking the Monotone Convergence Theorem in (TUI . 

In order to show the second part, it suffices to show that when we use the 
price induced by the upper bound as our static price, Pioss — t 0, as c — >■ oo. Note 
that as far as Pioss is concerned, the system can be viewed as having only a single 
class. Therefore, we can reuse the result in the last section. Compared with a 
system with no pricing control, the arrivals in this system are “weighted” by the 
static price Uk according to their holding times. Hence, the arrival process is still 
Poisson, with arrival rate A' = ^{uk)pk, and the holding time distribution 

is i.i.d. with mean T' = (YT=o'^k^{uk)pk'^ /A'. Because the constraint 111 111 is 
satisfied by such an induced static price, we have X'T' < R. Using the same 
techniques as in the proof of Proposition El we can show that Pioss — t 0 as 
c — >■ oo. Then, the result follows. □ 

Next we study the form of the optimal price Uk which maximizes the upper 
bound dinj. We construct another bound J' as the solution of the following 
optimization problem: 

OO 

J' = max F{X) Y ^kPk (12) 

k^l 

oo 

subject to A TkPk < R- 

k=l 

If J' = Jub, we will be able to conclude that the optimal Xk in (IITlIl should be 
independent of k, i.e., it is independent of the duration T! Therefore the price 
u{T) should also be independent of T. We now show that this is indeed the case. 

From the definition of Jub, J' < Jub- It remains to show that J' > Jub- For 
any set of Xk that satisfies the constraint of (IIOII . let 

E OO \ rri 

^ _ k^l '^k^kPk 

E OO rr 

k^l ^kPk 



( 10 ) 

( 11 ) 
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Then A will satisfy the constraint of ini). Since F is concave, we have 

'l2k = i ^i^k)TkPk 
l^k=l ^kPk 



Therefore J' > Jut- 

Now Xk is independent of k, and hence, so is Uk- We conclude that, in the 
asymptotic regime, not only is there no incentive to price according to instanta- 
neous utilization, but there is also no incentive to price based on the duration 
of the individual calls! Again, we recall the previous result, that the speed of 
convergence for this invariance result to hold is at least 1 / ^/c. 

Our above result suggests that in a very broad sense, the optimal price needs 
only to take care of one parameter, that is, the traffic class (in a sense the total 
cost also depends on duration, however the relationship is linear). Note the above 
result is obtained under the assumption that the arrival elasticity is independent 
of T. If this is not true, even though the performance of our static price in the 
form of u{T) will still approach the performance of the optimal dynamic scheme, 
u{T) will not be a constant. In this case, we may need to treat calls of different 
duration as different classes. 



4 Conclusion and Future Work 

In this work we study pricing as a mechanism to control large networks. Our 
model is based on revenue maximization in a general loss network with Poisson 
arrivals and arbitrary holding time distributions. In dynamic pricing schemes, 
the network provider can charge different prices to the user according to the 
varying level of utilization in the network. We prove that this type of a sys- 
tem with feedback is asymptotically stationary and ergodic. We then analyze 
the performance of the static pricing scheme compared with that of the optimal 
dynamic pricing scheme. We prove that there is an upper bound on the perfor- 
mance of dynamic pricing schemes, and that the performance of an appropriately 
chosen static pricing scheme will approach that of the optimal dynamic pricing 
scheme, when the system is large. Under certain condition, this static price will 
only depend on the price-elasticity of each class, while being independent of the 
route of the call. 

We then extend the result to the case when a network provider can charge 
a different price according to an additional factor, i.e., the individual holding 
time. We develop appropriate static pricing schemes for this case and show that 
the static scheme that is independent of the individual holding time performs as 
well as the optimal dynamic schemes, when the system is large. 

The above results have important implication in the networks of today and 
in the future. Compared with dynamic pricing schemes, static pricing schemes 
have some desirable properties. They are less computationally intensive, and 
consume less network bandwidth. Their performance will not degrade as the 
network delay grows. Our results show that when the system is large, as in 
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broadband networks, the difference between static pricing schemes and dynamic 
pricing schemes is minimal. We are currently investigating how to use these 
results to develop efficient and realistic algorithms to control large networks. 

We also note that when the holding time distribution is heavy tailed, the 
sample path convergence to the mean becomes slow. We think this is caused 
by the fact that with the Pareto distribution, some flows will have very large 
duration. This fact has two implications on the sample path convergence. One 
is that the average of the Pareto distributed random variables converges very 
slowly to their mean, because of a small number of very large samples. The other 
is that the queue dynamics tend to have correlation over a very large timescale, 
which leads to very slow convergence of statistics based on queue average. 

This indicates that the long term average may not be practically meaningful 
in such cases. The transient behavior might be more important, and deserves to 
be handled more carefully. 



5 Appendix 



Proof (of LemmaU}^. We follow p. To show P{Aq} > 0 , we note that for some 
o > 0, m > 1, 



I k—m 


oo 




< {tq > a} n {rlk 


<a} Q < 




[ k^l 


k—m-\-l 


{ j=-k+l } 



-1 



= P{rS>a}Y[P{Tl,<a}P{ f| ^ 









k — 'J 

j=-k+l 



This can be interpreted as the following: Aq is the event that {Qq = 0}, i.e., 
all potential arrivals before time 0 has left the system. The event on the right 
of the inequality above says that, of all the potential arrivals before time 0, the 
last arrival arrives before a time interval of a, (tq > a); the last m arrivals all 
have service time less than a; and finally, the rest of the arrivals leave the system 
before time —1. Obviously this is a smaller event than Aq. From now on we will 
focus on this event only. 

Now choose a such that P{t(_i, < a} = g > 0, we also have 
P{rf > a} = p > 0, since the interarrival times are exponential. 

Then 



P{Ao} > pq^P 




Now let B be the event inside the bracket on the right hand side. We only 
need to show P{B} > 0 for some m. 

Choose b < E{rf}. Then 



P{B^} = P\ U 



j=-k+l 
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1 
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[j = -/c+l J 
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<pl 


u 


i; y<i>(fc-i) 






[j=-fe+i j 






Now as m — >■ oo, the first term goes to 
-1 






-1 



U 1 E <<h{k-l)))=pl Y. r„^<6(fc-l)i.o. Uo 



m k—m-\-l I j = — fc+1 



j=-fc+i 



by Strong Law of Large Numbers (since b < E{t^}). 

On the other hand, as m — >■ oo, the second term goes to 

r oo ^ oo 

p\ U {Tl,>b{k-1)}\ < Y P{rlk>b{k-1)}^0 



^ k—m-\-l 



k—m-\-l 



since E{t^} < oo. 

Therefore we can choose m large enough such that P{B^} < 1/2. And 
P{Ao} > pq"^P{B} > pq^l/2 > 0. 



Proof (of Lemma\^. First note that 

( OO 

P{Xn S An at least once} = P < [J A„ 



Again let T be the shift operator, Let B = lJ/“ A„, then TB C B, and P{TB} = 
P{B}, because B is also a stationary event. Therefore TB and B differ by a set 
of measure zero, B is an invariant set. Since the arrivals are ergodic, P{B} = 
0 or 1. However, since P{B} > P{Ao| > 0, therefore P{P| = 1, i.e., P{X„ G 
An at least once} = 1. 

By 0, Prop 6.38. 

E{Ti\Xq g Ao} = < °°- 



□ 
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Abstract. The Assured Forwarding Per Flop Behavior (AF PHB) has been 
devised by the IETF Differentiated Services (DiffServ) working group to 
provide drop level differentiation. The intent of AF is to support services with 
different loss requirements, but with no strict delay and jitter guarantees. 
Another suggested use of AF is to provide differentiated support for traffic 
conforming to an edge conditioning/policing scheme with respect to non- 
conforming traffic. 

Scope of this paper is twofold. First, we show that, quite surprisingly, a 
standard AF PHB class is semantically capable of supporting per flow 
admission control. This is obtained by adopting the AF PHB as core routers 
forwarding mechanism in conjunction with an End Point Admission Control 
mechanism running at the network edge nodes. The performance achieved hy 
our proposed approach depend on the specific AE PHB implementation running 
in the core routers. 

In the second part of the paper, we prove that changes in the customary AF 
PHB implementations are indeed required to achieve strict QoS performance. 
To prove this point, we have evaluated the performance of our admission 
control scheme over a simple AF implementation based on RED queues. Our 
results show that, regardless of the selected RED thresholds configuration, such 
an implementation is never capable of guaranteeing tight QoS support, but is 
limited to provide better than best effort performance. 



1 Introduction 

Two QoS architectures are being discussed in the Internet arena: Integrated Services 
and Differentiated Services. Nevertheless, quoting the recent RFC [R2990], "both the 
Integrated Services architecture and the Dijferentiated Services architecture have 
some critical elements in terms of their current definition, which appear to be acting 
as deterrents to widespread deployment... There appears to be no single 
comprehensive service environment that possesses both service accuracy and scaling 
properties" . In fact: 

1. the IntServ/RSVP paradigm [R2205, R2210] is devised to establish reservations 
at each router along a new connection path, and provide "hard" QoS guarantees. 
In this sense, it is far to be a novel reservation paradigm, as it inherits its basic 
ideas from ATM and the complexity of the traffic control scheme is comparable. 
In the heart of large-scale networks, the cost of RSVP soft state maintenance and 
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of processing and signaling overhead in the routers is significant and thus there 
are scalability problems. In addition to complexity, we feel that the lack of a total 
and ultimate appreciation in the Internet market of the IntServ approach is also 
related to the fact that RSVP needs to be deployed in all the involved routers, to 
provide end-to-end QoS guarantees; hence this approach is not easily and 
smoothly compatible with existing infrastructures. What we are trying to say is 
that complexity and scalability are really important issues, but that backward 
compatibility and smooth Internet upgrade in a multi-vendor Internet market 
scenario is probably even more important. 

2. Following this line of reasoning, we argue that the success of the DiffServ 
framework [R2474, R2475] does not uniquely stays in the fact that it is an 
approach devised to overcome the scalability limits of IntServ. As in the legacy 
Internet, the DiffServ network is oblivious of individual flows. Each router 
merely implements a suite of scheduling and buffering mechanisms, to provide 
different aggregate service assurances to different traffic classes whose packets 
are accordingly marked with a different value of the Differentiated Services Code 
Point (DSCP) field in the IP packet header. By leaving untouched the basic 
Internet principles, DiffServ provides supplementary tools to further move the 
problem of Internet traffic control up to the definition of suitable pricing/service 
level agreements (SLAs) between peers. However, DiffServ lacks a standardized 
admission control scheme, and does not intrinsically solve the problem of 
controlling congestion in the Internet. Upon overload in a given service class, all 
flows in that class suffer a potentially harsh degradation of service. RFC [R2998] 
recognizes this problem and points out that "further refinement of the QoS 
architecture is required to integrate DiffServ network services into an end-to-end 
service delivery model with the associated task of resource reservation" . It is thus 
suggested [R2990] to define an "admission control function which can determine 
whether to admit a service differentiated flow along the nominated network 
path". 

Scope of this paper is to show that such an admission control function can be 
defined on top of a standard DiffServ framework, by simply making smart usage of 
the semantic at the basis of the Assured Forwarding Per Hop Behavior (AF PHB 
[R2597]). It is obvious that this function must not imply a management of per flow 
states, which are alien to DiffServ and which would re-introduce scalability problems. 
The scope of our admission control function is Internet-wise (i.e., not limited to a 
single domain). It is deployed by pure endpoint operation: edge nodes involved in a 
communication are in charge of taking an explicit decision whether to admit a new 
flow or reject it. These edge nodes rely upon the successful delivery of probe packets, 
i.e., packets tagged with a suitable DSCP label, independently generated by the end- 
points at flow setup. The internal differentiated management of probes and packets 
originated by already admitted flows is performed in conformance with the AF PHB 
definition. Also, following the spirit of DiffServ, the degree of QoS provided is 
delegated to each individual DiffServ domain, and depends on the AF PHB specific 
implementation and tuning done at each core router of the domain. 

For convenience, we will use the term GRIP (Gauge&Gate Reservation with 
Independent Probing) to name the overall described operation. GRIP was originally 
proposed in [BBOla], although its mapping over a standard DiffServ framework was 
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not recognized until [BBOlb]. GRIP combines the packet differentiation capabilities 
of the AF PHB with both the distributed and scalable logic of Endpoint Admission 
Control [BREOO], and the performance advantages of measurement based admission 
control schemes [BJSOO]. However, we remark that GRIP is not a new reservation 
protocol for the Internet (in this, differing from the SRP protocol [ALM98], from 
which GRIP inherits some strategic ideas). Instead, GRIP is a novel reservation 
paradigm that allows independent end point software developers and core router 
producers to inter-operate within the DiffServ framework, without explicit protocol 
agreements. 

The organization of this paper is the following. Section 2 provides an 
understanding of rationales and limits of Endpoint Admission Control. Section 3 
describes the GRIP operation and its support over AE PHB classes. Section 4 first 
qualitatively discusses the issue of performance achievable by specifically designed 
AE implementations; then presents numerical results that prove that GRIP’S 
performance over AE PHB routers, implemented with RED queues ([EVJ93]), are just 
limited to better than best effort support. Finally, conclusions are drawn in Section 5. 



2 Unfolding Endpoint Admission Control 

Endpoint Admission Control (EAC) is a recent research trend in QoS provisioning 
over IP [BREOO]. EAC builds upon the idea that admission control can be managed 
by pure end-to-end operation, involving only the source and destination host. At 
connection set-up, each sender-receiver pair starts a Probing phase whose goal is to 
determine whether the considered connection can be admitted to the network. In some 
EAC proposals [BOR99, ELEOO, BREOO], during the Probing phase, the source node 
sends packets that reproduce the characteristics (or a subset of them) of the traffic that 
the source wants to emit through the network. Upon reception of the first probing 
packet, the destination host starts monitoring probing packets statistics (e.g., loss 
ratio, probes interarrival times) for a given period of time. At the end of the 
measurement period and on the basis of suitable criteria, the receiver takes the 
decision whether to admit or reject the connection and notifies back this decision to 
the source node. 

Although the described scheme looks elegant and promising (it is scalable, it 
does not involve inner routers), a number of subtle issues come out when we look for 
QoS performance. A scheme purely based on endpoint measurements suffers of 
performance drawbacks mostly related to the necessarily limited (few hundreds of ms, 
for reasonably bounded call setup times) measurement time spent at the destination. 
Measurements taken over such a short time cannot capture stationary network states, 
and thus the decision whether to admit or reject a call is taken over a snapshot of the 
network status, which can be quite an unrealistic picture of the network congestion 
level. 

The simplest solution to the above issue (other solutions are being explored, but 
their complete discussion and understanding is way out of the aims of the present 
paper) is to attempt to convey more reliable network state information to the edge of 
the network. Several solutions have been proposed in the literature. [CKNOO] 
proposes to drive EAC decisions from measurements performed on a longer time 
scale among each ingress/egress pair of nodes within a domain. [GKE99, SZH99, 
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KELOO] use packet marking to convey explicit congestion information to the relevant 
network nodes in charge of taking admission control decisions. [MOROO] performs 
admission control at layers above IP (i.e., TCP), by imposing each core router to parse 
and capture TCP SYN and SYN/ACK segments, and forward such packets only if 
local congestion conditions allow admission of a new TCP flow. 

To summarize the above discussion, and to proceed further, we can state that an 
EAC is, ultimately, the combination of three logically distinct components (although, 
in some specific solutions - e.g. [BOR99, ELEOO] - the following issues are not 
clearly distinct, this does not mean at all that these three specific issues are not 
simultaneously present): 

1. edge nodes in charge of taking explicit per flow accept/reject decisions; 

2. physical principles and measures on which decisions are based (e.g., congestion 
status of an internal link or an ingress/egress path, and particular measurement 
technique - if any - adopted to detect such status); 

3. the specific mechanisms adopted to convey internal network information to edge 
nodes (e.g., received probing bandwidth measurement, IP packet marking, 
exploitation of layers above IP with a well-defined notion of connection or even 
explicit signaling). 

In such a view, EAC can be re-interpreted as a Measurement Based Admission 
Control (MBAC) that runs internally to the network (i.e., in a whole domain or, much 
simpler, in each internal router). This MBAC scheme locally determines, according to 
some specific criteria (which can be as simple as non performing any measure at all, 
and taking a snapshot of the link state, or as complex as some of the techniques 
proposed in [BJSOO, GR099]), whether a new call can be locally admitted (i.e. as far 
as the local router is concerned). This set of information (one per each distinct 
network router) is implicitly (or explicitly) collected and aggregated at the edge nodes 
of the network; these nodes are ultimately in charge of performing the Y/N decision. 

Put in these terms, EAC was sketched as early as in the SRP protocol 
specification [ALM98]. Unfortunately (see e.g., what stated in [BREOO]), SRP 
appeared much more like a lightweight signaling protocol, with explicit reservation 
messages, rather than an EAC technique with increased intelligence within the core 
routers. Moreover, SRP requires network routers to actively manage packets (via 
remarking of signaling packets when congestion occurs), and thus it does not fit 
within a DiffServ framework, where the core routers duty is strictly limited to 
forwarding packets at the greatest possible speed. 

Of the three components outlined above, we argue that, in a DiffServ framework, 
the least critical issue is how to estimate the congestion status of a router without 
resorting to per flow operation. In fact, recent literature [GR099, BJSOO] has shown 
that aggregate load measurements are extremely robust and efficient. These schemes 
do not exploit per-flow state information and related traffic specifications. Instead, 
they operate on the basis of per-node aggregate traffic measurements carried out at the 
packet level. The robustness of these schemes stays in the fact that, in suitable 
conditions (e.g. flow peak rates small with respect to link capacities), they are barely 
sensitive to uncertainties on traffic profile parameters. As a consequence, it seems that 
scalable estimations can be independently carried out by the routers. 
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The real admission control problem is how to convey the status of core routers 
(evaluated by means of aggregate measurements) to the end points so that the latter 
devices can take learned admission control decisions, without violating the DijfServ 
paradigm. For obvious reasons, we cannot use explicit per flow signaling. Similarly, 
we do not want to modify the basic router operation, by introducing packet marking 
schemes or forcing routers to parse and interpret higher layer information. What we 
want to do is to implicitly convey the status of core routers to the end points, by 
means of scalable, DiffServ compliant procedures. 



3 Grip: Endpoint Admission Control over AF per Hop Behavior 

We name GRIP (Gauge&Gate Reservation with Independent Probing) a reservation 
framework where Endpoint Admission Control decisions are driven by probing 
packet losses occurring in the internal network routers. The Gauge&Gate acronym 
stems from the assumption that network routers are able to drive probing packet 
discarding (Gate) on the basis of accepted traffic measurements (Gauge), and thus 
implicitly convey congestion status information to the edge of the network by means 
of reception/lack of reception of probes independently generated by edge nodes. 
GRIP concepts were originally introduced in [BBOla], but its mapping over DiffServ 
was not fully recognized until [BBOlb] (as a matter of fact, in [BBOla] we 
erroneously required the introduction of a new PHB to implement our scheme). The 
present paper moves further, and shows that the generic GRIP router operation is 
indeed intrinsically accounted in the Assured Forwarding (AF) PHB. In other words, 
the AF PHB class, with no further modifications to the specifications described in 
[R2597], is semantically capable of seamlessly supporting FAC. 

AF PHBs have been devised to provide different levels of forwarding assurances 
within the Differentiated Services Framework [R2474, R2475]. Four AF PHB classes 
have been standardized, each composed of three drop levels. In what follows, we will 
use the notation AFxj to indicate packet marks belonging to the AF class x, with drop 
level j. Conforming to [R2597], within a class x, if i<j, the dropping probability of 
packets labeled AFxi is lower than that of packets labeled AFxj. 

The example services presented in the appendix of [R2597] show that the 
primary intent of AF is to promote performance differentiation (in terms of packet 
drop), either among different traffic classes, e.g., marked with different drop levels, as 
well as within the same traffic class, e.g., marking traffic conforming to a policy 
specification with a lower drop level than non conforming traffic. However, low loss 
and low latency traffic support appears to be out of the targets of the AF model. To a 
larger extent, as discussed in the introduction, QoS guarantees appear not only 
unfeasible over AF, but also out of reach of the basic DiffServ architectural model, 
due to the lack of an explicit resource reservation mechanism. 

Based on the discussion carried out in Section 2, it is now possible to argument 
that the AF PHB definition contains all the necessary semantic to support per flow 
admission control. Quoting [R2597], “an AF implementation MUST detect and 
respond to long-term congestion within each class by dropping packets, while 
handling short term congestion (packet bursts) by queueing packets. This implies the 
presence of a smoothing or fdtering function that monitors the instantaneous 
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congestion level and computes a smoothed congestion level. The dropping algorithm 
uses this smoothed congestion level to determine when packets should be discarded” . 

This sentence explicitly states that an AF router is capable of supporting an 
eventually sophisticated measurement criterion that can drive packet discarding. To 
run EAC over an AF PHB class it is simply necessary to clarify issue (3) presented in 
Section 2 (i.e., the specific mechanism adopted to convey internal network 
information to edge nodes). This is done by assigning to a specific AF dropping level 
the task of notifying internal network congestion to the end nodes by means of packet 
dropping (which, for an AF-compliant router, is the only capability we can rely on). 



3.1 AF Router Operation 

A particular implementation of the DiffServ router output port operation supporting 
the AF PHB is depicted in Fig. 1 . Packets routed to the relevant output are classified 
on the basis of their DSCP tag and dispatched to the relevant PHB handler. 




Fig. 1. Router output port operation 

Let us now focus our attention to a specific module in charge of handling AF 
traffic belonging to a given class x. 

A measurement module is devised to run-time measure the aggregate AF class x 
traffic (or AFx traffic). The measurement module depicted in the figure does not 
interact with the AFxl packets forwarding, i.e., these packets are forwarded to the 
FIFO buffer placed at the output regardless of the measurements taken. On the basis 
of such measurements, this module triggers a suitable dropping algorithm on the 
AFx2 traffic. With respect to the general AF PHB operation, our AFx2 dropping 
algorithm depends on AFx traffic measurements. Note also that for simplicity of 
presentation, the drop level AFx3 is neglected until Section 3.3. 
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The simplest dropping algorithm is represented by a “gate” (smoother dropping 
algorithms for AFx2 packets - e.g. RED-like algorithms - may be considered to 
improve stability). When the measurement module does not detect congestion on the 
AFx traffic, being the notion of congestion implementation-dependent, it keeps the 
gate opened (we call this “ACCEPT” state). When the gate is open no AEx2 packet is 
dropped. Conversely, the measurement module keeps the gate closed (“REJECT” 
state) when congestion is detected, i.e., it enforces a 100% drop probability over 
AFx2 packets. Note that this operation does not violate the AE drop level relationship, 
as AExl dropping probability is lower than the AEx2 one. 

While the above description is simply a particular implementation of an AF class, 
we now show its interpretation in terms of implicit signaling, which has important 
consequences for the definition of our overlay admission control function. In fact, let 
us assume that: i) the considered AF class x, is devoted to the support of QoS aware 
flows, requiring an admission control procedure; ii) traffic labeled AFxl is generated 
by flows which have already passed an admission control test, hi) AFx2 packets are 
“signaling” packets injected in the network by flows during the setup phase (in 
principle, one AFx2 packet per flow). 

According to the described operation, an AFx2 packet is delivered to its 
destination ONLY IF it encounters all the routers along the path in the ACCEPT state. 
This operation provides an implicit binary signaling pipe, semantically equivalent to a 
one-bit explicit congestion notification scheme, without requiring explicit packet 
marking, or, worse, explicit signaling messages, contrary to the DiffServ spirit. 

The described router output port operation, combined with an endpoint admission 
control logic, allows overlaying an implicit signaling pipe over a signaling-unaware 
DiffServ framework. In fact, when an AFx2 packet reaches the destination, it 
implicitly conveys the information that all routers encountered across the path have 
been locally declared themselves in the ACCEPT state, i.e., capable of admitting new 
connections (see next Section). 

Finally, with reference to Fig. 1, the AFx PHB class handler stores packets in a 
FIFO buffer, to ensure that packets are forwarded in the order of their receipt, as 
required by the AF PHB specification [R2597]. Packets transmission over the output 
link is finally managed by a scheduler, which has the task of merging the traffic 
coming from the different PHB handlers implemented within the router output port. 



3.2 End Point Operation 

For clarity of presentation, in what follows, we identify the source and destination 
user terminals as the network end nodes0 . Consider a scenario where an application 
running on a source node within a DS domain wants to setup a one way (e.g., UDP) 
flow with a destination node, generally in a different DS domain. As shown in Fig. 2, 



1 Although, logically, user terminals are the natural nodes where the endpoint admission 
control should operate, this is clearly not realistic, for the obvious reason that the user may 
bypass the admission control test and directly send AFxl packets. Identity authentication and 
integrity protection are therefore needed in order to mitigate this potential for theft of resources 
[R2990]. Administrators are then expected to protect network resources by configuring secure 
policers at interfaces (e.g. access routers) with untrusted customers. 
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the source node, triggered by the application via proprietary signaling, starts a 
connection setup attempt by sending in principle just one single packet (more 
discussion about this at the end of this Section), labeled AFx2 through the network. In 
the same time, a probing phase timeout is started. 

The role of the Destination Node simply consists in monitoring the incoming IP 
packets and detecting those labeled AFx2. Upon reception of an AFx2 packet, the 
destination node performs a receiver capability negotiation function, eventually based 
on proprietary signaling, and aimed at verifying whether the destination application is 
able and willing to accept the incoming flow. We stress that such receiver capability 
negotiation is recognized as an important functionality for QoS enabled applications 
[R2990], and it is in an important by-product of our solution. If the destination node is 
willing to accept the call request, it simply relays, for each incoming probe packet, 
with the transmission of a feedback packet. For highest probability of delivery, the 
feedback packet is marked AFxl (e.g., as an information packet). 



Source Source 



Destination Destination 




Fig. 2. End point GRIP operation 

The decision whether to admit or reject the call request is driven by the eventual 
reception of the feedback by the source. When a feedback packet is received in 
response, the setting up flow is elected at the state of "accepted", and the source node 
can starts transmitting information packets, labeled as AFxl. Conversely, by not 
receiving a feedback packet within the probing phase timeout, the source node is 
made able to implicitly determine that at least one router along the path has declared 
itself not capable of accommodating additional flows, and thus the source node can 
abort the flow setup attempt (or reiterate the setup attempt according to some suitable 
backoff mechanism). 

In EAC terms, packets labeled AFx2 have the meaning of probes, while AFxl 
packets are meant to support already accepted traffic. The role of the drop level AFx3 
is addressed in Section 3.3. We have adopted AFxl as the label assigned to the 
feedback packet since the goal of the feedback is to report back to the source the 
information that the probe has been correctly received. In the case bidirectional flow 
setup is aimed at, the feedback packet has the additional task of testing the reverse 
path, and consequently it will be transmitted with an AFx2 label. 
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Note that the basic GRIP operation, i.e., a single probe packet and a single 
feedback packet, is compatible with the H.323 call setup scheme using UDP, which 
encapsulates a H.225.0v2 call setup PDU into a UDP packet. Our solution seems then 
perfectly compatible with existing applications. In addition, this scheme leaves the 
service provider free to provide optional implementation details, including: 

- Addition of proprietary signaling information in the probing packet payload or 
in the feedback packet payload, to be parsed, respectively, at the destination node 
or at the source node. 

- Definition of more complex probing phase operation, e.g., by including 
reattempt procedures after a setup failure, multiple timers and probes during the 
probing phase, etc. 

As a last consideration, it is quite interesting to remark that this idea is extremely 
close to what TCP congestion control technique does, but it is used in the novel 
context of admission control: end points interpret failed receptions of probes as 
congestion in the network and reject the relevant admission requests. 



3.3 Possible Roles of the AFx3 Level 

In the above description, the AFx3 drop level appears in principle unnecessary. 
However, it may be convenient to use this level. A first possibility is that the AFx3 
level be used to mark non-conforming packets, which can be eventually delivered if 
network resources are available. Second, AFx3 packet marking can be enforced over 
flows that have not successfully passed the described admission control test. This 
allows deploying a service model where high QoS is provided to flows that pass the 
admission control test, while best effort delivery is provided to initially-rejected 
flows. These latter flows may occasionally retry the setup procedure, by simply 
marking occasional packets as AFx2 (e.g. by adhering to a suitable backoff 
procedure), and may eventually receive the upgraded AFxl marking when network 
resources become available (as testified by the eventual reception of an AFxl 
feedback). 

The usage of the AFx3 level as described above is targeted to increase the link 
utilization. However, [R2597] requires the drop probability for AFx3 to be greater (or 
at most equal) than AFx2. This implies that the link utilization is bounded by the 
possibly strict mechanism that triggers AFx2 packets dropping: when AFx2 packets 
receive a 100% dropping probability, all AFx3 packets must also be dropped to 
conform to the [R2597] specification. A more effective mechanism would consist in 
implementing a dropping algorithm for the AFx3 traffic not directly related to the 
AFx2 drop algorithm. However, this usage of the AFx3 level does not conform to the 
AF specification, since the AFx3 dropping probability may be eventually lower than 
AFx2. 

A more interesting possible usage of the AFx3 level consists in providing a 
second control (probing) channel, in addition to AFx2. According to this solution, 
AFxl traffic measurements trigger a dropping algorithm on the AFx3 traffic too, with 
stricter dropping conditions than the AFx2 dropping algorithm (i.e. AFx3 packets are 
assumed to detect congestion, and notify it via packet drop, before AFx2 packets). 
This AFx3 probing class could request admission for flows with e.g., higher peak rate 
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and bandwidth requirements than flows supported via the AFx2 probing class (i.e., we 
are adding a second implicit signaling pipe). Also, being the AFx3 channel more 
reactive to congestion conditions, its usage can be envisioned to provide lower access 
priority to network resources. This would improve fairness and avoid some kinds of 
sources to “steal” a large part of network resources. 



3.4 Arguments for New PHB Definitions 

Although this paper leaves untouched the basic AF PHB semantic, we feel that our 
suggested usage of AF is different (and quite unexpected) from what intended in RFC 
2597. The services that are expected to make use of admission control are RTP/UDP 
streams with delay and loss performance requirements, whose support is currently 
envisioned by means of the EF PHB. On the contrary, AF appears designed to provide 
better than best effort support for generic TCPAJDP traffic. Thus, our study raises the 
case for the transformation of the (single) EF PHB into a PHB class (i.e. by adding an 
associated, "paired", probing pipe with a different DSCP). An alternative is defining 
new "paired" PHBs. 

On a different prospective, paired PHBs can he envisioned to support more 
general control functions than admission control. For example, the TCP fast 
retransmission and recovery algorithm might take advantage of isolated data packets 
labeled as “control”, and thus expected to encounter loss if (controlled) congestion is 
encountered in the network. 



4 Performance Issues 

The described admission control semantic provides a reference framework compatible 
with "current" AF implementations. Scope of this section is to provide, in section 4.1, 
some qualitative insights about the performance achievable by the GRIP operation. 
Then, in section 4.2 we show that poor performance are provided over RED-like 
mechanisms customarily used to implement AE PHBs. These results reported in this 
section allow us to conclude that new explicit traffic measurement module 
implementations appears necessary, if tight QoS support is aimed at. 



4.1 Degree of QoS Support in GRIP 

Quantitative and tunable performance may be independently provided and specified 
by each administrative entity. Uniform implementation across a specific domain 
allows defining a quantitative view (e.g., a PDB [PDBOl]) of the performance 
achievable within a considered DS domain. In this way, the refinements deemed 
necessary in [R2990] to provide service accuracy in the DiffServ architectural model 
could be considered as accomplished. 

In fact, the performance achievable by the described end point admission control 
operation depends on the notion of congestion as the triggering mechanism for AEx2 
packet discarding, which is left to each specific implementation. Each administrative 
entity may arbitrarily tune the optimal throughput-delay/loss operational point 
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supported by its routers, by simply determining the aggregate AFxl traffic target 
supported in each router. The mapping of AFxl throughput onto loss/delay 
performance in turns depends on the link capacities and on the traffic flow 
characteristics offered on the AF class x. 

With this approach, it is possible to construct PDBs offering quantitative 
guarantees. A building block of such PDBs is the definition of specific measurement 
modules and AFx2 dropping algorithms. A generic dropping algorithm is based on 
suitable rules (or decision criteria). An example of a trivial decision criterion is to 
accept all AFx2 packets when the measured throughput is lower than a given 
threshold and reject all AFx2 packets when the AFxl measurements overflow this 
threshold. The resulting performance depends upon the link capacity and the traffic 
model. 

It is well recognized that target QoS performance can be obtained by simply 
controlling throughput (i.e., by aggregate measurements taken on accepted traffic). 
This principle is at the basis of more sophisticated state of the art MBAC 
implementations described in [BJSOO, GR099]. As a simple quantitative example, 
with 32 Kbps peak rate Brady on-off voice calls (see section 4.2) offered to a 2 Mbps 
(20 Mbps) link, a target link utilization of 75% (92%) leads to a 99th percentile per 
hop delay lower than Sms - see figure 3 (reproduced from [BCPOO]). 




Accepted Load 



Fig. 3. 99th delay percentile versus throughput, for different EAC schemes (see [BCPOO]) and 
related parameter settings 

Tighter forms of traffic control are possible. As a second example of a decision 
criterion, we demonstrated that hard (loss and/or delay) QoS guarantees can be 
provided, under suitable assumptions on the offered traffic (i.e., traffic sources 
regulated by standard Dual Leaky Buckets, as in the IntServ framework) and with ad 
hoc defined measurement modules in the routers [BBOla]. When hard QoS 
guarantees are aimed at, it is furthermore necessary to solve the problem of 





Endpoint Admission Control over Assured Forwarding PHBs 243 



simultaneous activation of flows. In fact, the Gauge&Gate operation implies that 
simultaneous setup attempts (i.e., probes arriving at the router within a very short time 
frame) may see the router in the ACCEPT state, and thus may lead to concurrent 
activation of several flows, which can in turns overload the router above the promised 
QoS level. Actually, this is a common and recognized problem of any MBAC scheme 
that does not rely on explicit signaling. Although it does not compromise the stability 
of described operation (overloaded routers close the gate until congestion disappears - 
see the mathematical formalization of such a problem and the computation of the 
"remedy" period in [GR099]), this can be a critical issue if strict QoS guarantees are 
aimed at. In [BBOla] we have solved this problem by introducing an aggregate stack 
variable, which takes into account “transient” flows, i.e. flows elected at the state of 
"accepted" but not yet emitting information packets. This stack protection scheme 
avoids the concurrent activation of a number of flows, which could overload the 
router above the promised QoS level: the price to pay is a slight under-utilization of 
the available link capacity. 

Another important issue is what happens when traffic flows with widely different 
peak rates are offered to the same link. Several approaches may be adopted. The 
simplest one is to differentiate traffic aggregates into classes of similar traffic 
characterization (e.g. similar peak rate), and associate to each class a different AF 
PHB class X, each with its probes AFx2 and data packets AFxl. A second possibility 
is to multiplex traffic onto a same AF PHB class, but differentiate probing packets by 
using the AFx3 drop level for traffic flows with higher peak rate (as briefly sketched 
in section 3.3). 

Finally, we note that the AFx2 dropping algorithm must not be necessarily driven 
by IP-level traffic measurements. In fact, it can be driven by lower layers QoS 
capabilities (e.g., ATM). 



4.2 Performance of GRIP over RED Implementations of the AF PHB 

In this paper we have demonstrated that, quite surprisingly, admission control can be 
deployed over the standardized Assured Forwarding PHB. In the previous section 4.1, 
we have concluded that arbitrary degree of performance guarantees can be obtained 
by designing specific AF PHB implementations based on runtime traffic 
measurements. A question that comes out naturally is the following. As long as 
Random Early Discarding (RED) queue management is customarily considered as the 

“natural” AF PHB implementation^ , what are the performance of GRIP when it is 
operated over RED queues? 

In our simulation program, we have assumed, for convenience, only two drop 
levels, namely AFxl and AFx2. We have adopted a single buffer for both AFxl and 
AFx2 packets. AFxl packets are dropped only if the buffer is completely full. 



^ We recall that the AF PHB specification [R2597] does not recommend, by any means, a 
specific implementation. Indeed, RED have emerged as the natural AF PHB implementations, 
since they provide improved performance when TCP traffic (i.e., the traffic traditionally 
envisioned for AF) is considered. We now face a very different problem, i.e. what happens to 
performance when widespread RED AF PHB implementations are used for a completely 
different purpose, i.e. to support admission controlled (UDP) traffic. 
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Fig. 4. Buffer management scheme 

Instead, AFx2 packets are dropped according to a Random Early Discarding 
(RED) management, where the AEx2 dropping probability is a function of the queue 
occupancy, computed accounting for both AFxl and AFx2 packet^ As shown in 
figure 4, the AFx2 dropping probability versus the queue occupancy is a piecewise 
linear curve: no AFx2 packets are dropped when the number of packets stored in the 
queue is lower than a lower threshold. As the number of packets gets greater than the 
lower threshold, the AFx2 dropping probability increases linearly, until a value is 
reached in correspondence with an upper threshold. After this value, the dropping 
probability is either set at 100% (in some RED implementations), or increased 
linearly until the number of packets fills the buffer capacity (see figure 4). It results 
that a RED implementation of the AEx2 dropping algorithm is completely specified 
by means of 4 parameters: the buffer size, the lower and upper thresholds, and the 
value 

Depending on the considered implementation, the AFx buffer occupation is either 
sampled at the arrival of an AFx2 packet, or suitably smoothed/filtered in order to 
capture the moving average of the AFx queue occupancy. However, in all 
implementations proposed in the literature, an AFx2 packet that finds no packets 
stored in the buffer is always accepted (regardless of the fact that the smoothed AFx 
queue occupation may give a value different from 0). We will see in what follows 
that, ultimately, this specific condition prevents GRIP to achieve effective 
performance, regardless of the RED parameter settings considered. 

Performance results have been obtained via simulation of a single network link, 
loaded with offered calls arriving at the link according to a Poisson process. Each 
offered call generates a single probe packet. A sufficiently large probing phase 
timeout has been set to guarantee that calls are accepted when the probing packet is 
not dropped (in the simulation, we have attempted to simulate conditions as close to 
ideal as possible, in order to avoid that numerical results were affected by marginal 
parameters settings - e.g round trip time, probing phase timer, etc). Accepted calls 



^ The described operations can also be seen as a particular case of a standard WRED 
implementation [CIS], where the and thresholds for AFxl packets both coincide with 
the buffer size. 
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have been modeled as Brady ON-OFF sources, with peak rate equal to 32 Kbps, and 
ON/OFF periods exponentially distributed with mean value, respectively, 1 second 
for the ON period, and 1.35 seconds for the OFF period (yielding an activity factor 
equal to 0.4255). Each call lasts for an exponentially distributed time, with mean 
value 120s. The link capacity has been set to 2 Mbps. Therefore, The link results 
temporarily overloaded when more than 146.9 active connections (2000/(0.4255x32)) 
are active at a given instant of time. 

Figures 5 to 7 report the number of accepted calls versus the simulation time for 
three different load conditions: underload (normalized offered load equal to 75% of 
the link capacity), slight overload (110% offered load), and harsh overload (400% 
offered load). 



offered load: 75% 




simulation time (sec) 



Fig. 5. Active call vs. simulation time, with 75% offered load; “packs” indicates the AFx2 
threshold. 



offered load: 110% 




simulation time (sec) 



Fig. 6. Active call vs. simulation time, with 110% offered load.“packs” indicates the AFx2 
threshold. 
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offered load: 400% 




simulation time (sec) 



Fig. 7. Active call vs. simulation time, with 400% offered load, “packs” indicates the AFx2 
threshold. 

In the above figures, we have considered a very large AFx buffer size, such that 
no AFxl packet losses occur (thus, QoS of accepted traffic is quantified in terms of 
AFxl packet delay). We have tried several different RED parameters configuration, 
but, for simplicity of presentation, we report results related to a basic parameters 
settings where a single threshold is considered: all AFx2 packets are accepted 
whenever the number of AFx packets is lower than this threshold, while all AFx2 
packets are dropped when the number of AFx packets is greater than this threshold 
(very similar results are obtained with more complex RED configurations, as it will 
be clear from the following discussion). Similarly, for simplicity, no smoothing on the 
buffer occupancy has been performed. 

Figure 5 shows that, in low load conditions, a tight threshold setting (just 1 
packet, i.e. an AFx2 packet is dropped as long as just 1 AFx packet is stored in the 
buffer) is overly restrictive, and exerts a high and unnecessary blocking probability on 
offered calls. With larger thresholds (200 and 2000 packets), we note from figure 5 
that the number of accepted calls fluctuates in the range 100 to 120, meaning that just 
a small fraction of the offered calls are blocked by the GRIP operation (as it should 
ideally occur if a stateful admission control algorithm were operated). 

Much more interesting and meaningful (for our purposes) are the results 
presented in figures 6. Here, we see that, in slight overload conditions, the only RED 
configuration setting that allows to keep the accepted load lower than the target 75% 
value (i.e. about 110 accepted calls) suggested by figure 3, is the threshold set to just 
1 packet. Even with a very small threshold value, such as 10 packets, we see that the 
average number of accepted calls gets much greater than 110, thus resulting in 
unacceptable delay performance for accepted traffic. 

Throughput and delay performance are quantified in table 8, for 110% and 400% 
offered load conditions. The table reports the AFxl throughput, as well as the 95th 
and 99th delay percentiles experienced by accepted flows. Confidence intervals 
corresponding to a 95% confidence level are also reported in the table to quantify the 
accurateness of the numerical results. 
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AFx2 

thresh 


Offered 

load 


Throughput 


95th delay 
percentile (ms) 


99th delay 
percentile (ms) 


1 pack 
1 pack 


110% 

400% 


68,64%± 0,28% 
90,89%± 0,08% 


2,3± 0,014 

80,6± 1,2 


3,3± 0,019 
175,2± 1,3 


1 0 packs 
1 0 packs 


110% 

400% 


88,71 %± 0,22% 
97,72%± 0,03% 


55,4± 0,7 

586,3± 4,8 


148,444± 3.0 

987,2± 10,2 


200 packs 
200 packs 


110% 

400% 


93,1 1%± 0,28% 
99,02%± 0,02% 


235,9± 4,0 

1282,8± 15,9 


433,8± 3,5 

2023,3± 75,1 


2000 packs 
2000 packs 


110% 

400% 


96,39%± 0,24% 
99,77%± 0,03% 


1433,9± 20,0 

4656,0± 24,9 


1975,6± 26,9 
5921 ,8± 58,2 



Fig. 8. Throughput and delay perfomance. 



From the table, we see that the only case in which we meet target QoS 
performance for IP telephony (i.e. 99 th delay percentile of the order of few ms) is the 
case of threshold set to 1 packet, and light overload. It is quite impressive to note that 
the smallest possible RED threshold (i.e. drop an AFx2 packet whenever the AFx 
queue is not strictly empty) does not succeed in guaranteeing QoS in large overload 
conditions. This result allows us to conclude that, regardless of the RED parameters 
configuration, a RED implementation is never capable of guaranteeing QoS in all load 
conditions. As long as a RED implementation always accepts an AFx2 packet when 

the AFx queue is emptyEl, performance will be at best equal to that reported in table 
8, for a threshold equal to 1 packet. 

As figure 7 shows, in high overload conditions, thresholds greater than 1 packet 
cannot even avoid that, temporarily, the number of accepted calls is greater than 
146.9, i.e. that the link is overloaded. In such a case, load oscillation phenomena 
occur: as clearly shown in figure 7, the link alternates between periods of significant 
overload (in which the AFx buffer fills up), and “remedy” [GR099] periods, where 
the router locks in the REJECT state, until congestion disappears. The result is that 
95th and 99th delay performance are of the order of several seconds (see table 8). 

To conclude this section, we observe that, although RED implementations are 
intrinsically uncapable of providing performance guarantees, indeed a proper 
parameter setting allows to achieve reasonably better than best effort performance. 
For example, with a threshold set to 10 packets, table 8 shows that, in very high 
(unrealistic) load conditions, the 99th delay is still lower than 1 second (although the 
link has already been congested, as proven by the link load fluctuations, of the order 
of 15% of the link capacity, and leading to temporary link overload, as shown in 
figure 7). Notably, with the same 10 packets thresholds, the 99th delay percentile 
drops down to less than 150ms when light overload conditions are considered. Such 



^ We recall that this specific rule is proper of all RED implementations. I.e., regardless of 
the smoothing and filtering scheme adopted on the number of AFx packets, an AFx2 packet is 
never dropped when an empty AFx buffer is found. Changing this rule means changing the 
intrinsic logic of the RED approach, i.e. moving from queue status measurements to crossing 
traffic measurements. What we have proven here is that such a leap is required in AF 
implementations if QoS guaranteed admission controlled services are to be supported. 
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degree of QoS support might be considered sufficient in a short-term perspective, 
where current AF implementations might be utilized to support admission control. 



5 Conclusions 

In this paper, we have shown that the standard DiffServ AF PHB is semantically 
capable of supporting stateless and scalable admission control. The driving idea is that 
accept/reject decisions are taken at the edge of the network on the basis of probing 
packet losses, being probes tagged with a different AF level than packets generated by 
accepted traffic. In turns, these losses are driven by the dropping algorithm adopted in 
the specific AF implementation running at each network router. It is important to 
understand that, following the spirit of DiffServ, the above described operation does 
not aim at providing a quantified level of assured QoS, but, in conformance with PHB 
and PDB specifications, it provides a reference framework over which quantitative 
performance specification may be deployed. To the purposes of this paper, it was in 
our opinion sufficient to show that the described operation is compliant with the 
specification of the AF PHB. 

The key to QoS guarantees is left to each specific implementation, i.e. is left to 
the implementation-dependent quantification of the notion of congestion, which 
triggers the gate mechanism for AFx2 packet discarding. Each administrative entity is 
in charge of arbitrarily determine the optimal throughput/delay operational point it 
wants to support. 

The AF PHB implementations so far considered in the literature, based on (RED) 
thresholds set on the queue occupancy, do not affect the described endpoint operation 
and thus may be considered to support admission controlled traffic. However, an 
addition important contribute of this paper was to prove that RED implementations 
are uncapable of achieving tight QoS support, regardless of their parameter settings. 
In fact, the "measurement” mechanism adopted is overly simple, and this translates 
into a poor QoS support, but still much better than best effort, since admission control 
is still enforced on setting up connections. Indeed, this guarantees that the described 
GRIP operation can be supported over already deployed AF routers to seamless 
improve QoS with no internal routers modification; it could be sufficient for short- 
term perspectives, but deeper enhancements are required in order to satisfy coming 
needs, in a long-term perspective. 
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Abstract. Providing Quality of Service in a multimedia context in next genera- 
tion Internet poses a number of open challenges, especially regarding routing and 
resource management strategies. In fact, the massive amount of data that is envis- 
aged to be transported in the near future (especially in a Differentiated Services 
framework) will make it difficult to have networks that are deeply rearrangeable 
in real time. A more realistic framework will most likely see a joint action of low- 
level, lastly reactive, and high-level, slowly reactive, tuning operations. Within 
this trend, we introduce and analyze a system in which bandwidth assignment 
and routing are readjusted within a layered architecture. Low-level local tuning 
procedures are lastly reactive and operate on small parts of the network; high-level 
control actions act on a much longer time scale and affect the whole network: they 
are able to perform global reallocations in order to cope with heavy network traffic 
variations. It is shown by numerical simulation results that the use of the proposed 
mechanism positively affects the overall network performance. 



1 Introduction 

The Internet is shifting from a mere connectionless best-effort paradigm to a network 
able to guarantee (at least) some minimum set of Quality of Service (QoS) for a wider 
range of traffic types(see, e.g., mi A notable “milestone” of such a transition has been 
the introduction of the new IP protocol (i.e., IPv6 |3l|), which allows to identify users’ 
flows while still maintaining IP’s connectionless paradigm. Furthermore, a weak form 
of resource reservation has been eased by the definition of reservation protocols and 
frameworks such as the Integrated Services m (IntServ) and Differentiated Services 0 
(DiffServ) ones. Moreover, in this transition context, the Multi Protocol Label Switching 
(MPLS) 0 represents a very powerful, efficient and simple solution for the actual 
implementation of a large number of routing strategies. 

The IntServ framework acts on a per-connection basis by running a reservation 
protocol |6| at the beginning of each session in order to check for resource availability. 
It has been widely accepted, however, that this solution is not scalable and, as such, is 
not applicable on a global framework. Scalability is, instead, addressed by the DiffServ 
solution that defines a set of “high-level pipes” satisfying a given set of QoS constraints: 
the network only copes with a limited number of service classes. IntServ and DiffServ 
are envisaged to cooperate in the access and in the core part of the network, respectively. 

S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 251^^ 2001. 
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In a DiffServ domain, the services offered by the network provider are defined in 
the so-called Service Level Agreement (SLA) 0|. The SLA can be either static - the 
most commonly used type today - or dynamic. In the latter case, the network has to be 
prepared to react to (a set of predefined) changes in input flow parameters, as well as in 
the required QoS, without human intervention. Thus it requires an automated agent and 
protocol (for example, a bandwidth broker (2J|) to represent the differentiated service 
provider’s domain. It is worth noting that the SLAs in DiffServ apply to aggregates of 
traffic and not to individual flows. Hence, while the IntServ mechanism acts on network 
resources on a connection-level time scale, the DiffServ one (for scalability reasons) 
should intervene at a slower pace in tuning its resources. 

The “boundary device” between IntServ and DiffServ frameworks is the so-called 
DiffServ edge router. A possible role of such device is to map IntServ flows (from the 
access network) into pre-defined DiffServ pipes (in the core network) and to re-create 
the IntServ flows (to be delivered through the access network) from DiffServ aggregates. 
Merging and splitting flows is well supported by MPLS. 

We assume a model in which, when there are too many incoming user flows, DiffServ 
edge routers issue a request for resource upgrade to the DiffServ bandwidth broker, along 
a given source-destination route. For the sake of simplicity, such requests come in the 
form of fixed amount of bandwidth, a sort of “bandwidth quark”. This paradigm allows 
to re-use telephone-like (or ATM-like) resource control solutions. 

In order to be fastly reactive, the bandwidth broker reserves in advance a given 
amount of bandwidth for each source-destination route in the core network (sources 
and destinations are, in this framework, edge routers). It then checks if the pre-reserved 
bandwidth is large enough to accommodate the request along such route. If there is not 
room, then an attempt is done through an alternate route. In case of success the request is 
accepted and buffers and scheduling controllers are re-configured properly. Otherwise, 
the request is rejected. 

Hence, in this context, an incoming flow is to be intended as a request to add a given 
amount of traffic to a DiffServ pipe, i.e., from a given source to a given destination in 
the DiffServ network. As such, these bandwidth requests may come and go according 
to some arrival and departure distribution. We may envisage the edge router acting 
as filter: when too many IntServ user-requests are blocked, the edge router issues a 
“flow” request to the system. Of course, depending on resource availability on the core 
(DiffServ) network, also such edge router requests might be accepted or blocked. 

The bandwidth broker has two main alternative approaches for supporting the Diff- 
Serv classes of service 1^]: i) complete statistical multiplexing, where the resources 
(buffers and bandwidth) are shared among different services; ii) limited statistical multi- 
plexing, where services with possibly widely different performance requirements and/or 
statistical characteristics of the traffic sources are assigned separate resources, in vari- 
ous combinations {service and/or route separation). One advantage of the first approach 
may be to reduce the number of blocked flows; a main limit is, to some degree, its an- 
alytical complexity. On the other hand, a service and path separation approach has the 
main advantage of being simple and leading to manageable and controllable models; 
a disadvantage may be some under-utilization of bandwidth, which, however, may be 
mitigated by adaptive allocation mechanisms. 
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In this work, we investigate an adaptive service separation scheme, the aim of which 
is to keep the advantage of simplicity and to enhance the bandwidth usage by introducing 
adaptability in DiffServ resource allocation. Furthermore, such allocation scheme, to- 
gether with an appropriate joint routing strategy, allows the network to be robust against 
different types of failures. 

The strategy adopted is along a similar line as the one initially outlined in 0. The 
robustness of the proposed policy lies in its capability of responding to possible traffic 
changes in real-time; a part of the total capacity of any physical link is kept in a bandwidth 
pool and used only in case of necessity: if blocking on a DiffServ pipe exceeds some 
threshold, a certain amount of bandwidth can be moved from the bandwidth pool to 
enlarge the pipe size. On the other hand, if a pipe is under-utilized, part of its bandwidth 
can be given back to the pool. Reconfiguration is triggered by predefined blocking 
probability thresholds. 

However, this mechanism only acts on a local basis and, as such, it could lead, in 
the long term, to a poor bandwidth distribution in the whole network. As an example, 
consider two pipes sharing some physical link: if the first one gets over utilized, the pool 
mechanism will try to enlarge it, with the possible side effect of leaving no resources for 
later enlargements of the second one. 

Hence, periodically (but at a lower pace), a second mechanism is triggered, in order 
to perform a global bandwidth redistribution among all pipes on a “global information” 
base. 

The global reallocation mechanism we have adopted, tries to distribute bandwidth 
evenly. A further possibility, not included In this work, would be to allow for priorities 
(e.g., price-based (TQl) among the various DiffServ pipes. 

The paper is organized as follows: in Section|2|the control system model, the routing 
framework, the route selection, and the bandwidth reallocation mechanism are intro- 
duced; Section 0 describes some simulation results for performance evaluation, and 
Section0contains the conclusions. 



2 The Control System Model 

The implementation, configuration, operation and administration of the nodes of a Diff- 
Serv Domain should effectively partition the resources of those nodes and the inter-node 
links between behavior aggregates (i.e., traffic classes) in accordance wifh the domain’s 
service provisioning policy. 

Hence, similarly to what happens with ATM networks |I9| (with Virtual Path Connec- 
tions) a DiffServ pipe may be seen as a concatenation of one or more of such partitions 
and identifies a direct (virtual) path between a given source-destination (SD) pair. A 
direct path passes through one or more (pre-configured) network routers. The definition 
of the direct path is left to the bandwidth broker that identifies it by optimizing resource 
sharing within the network, subject to QoS constraints. 

Overloaded flows can be “tunneled” (in the MPLS way) to the given destination 
through other (less overloaded) nodes. This operation is resource consuming (i.e., it 
does not use the optimal route) and, as such, in our case, we restrict such alternative 
paths to only those composed of two hops. 
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A logical view of the core network is hence huilt on top of the physical structure; 
in a service separation framework, the number of such layers is proportional to the 
number of traffic services. DiffServ pipes can be set up by keeping into account a 
number of constraints that cope with network robustness and efficiency. For example, 
avoiding having more than one pipe, for each particular SD pair, sharing the same 
physical links, would increase network robustness, while having direct pipes between 
the most requested SD pairs would increase network efficiency. In our test-bed case, for 
the sake of simplicity, every SD pair is connected by a direct path and, to the extent 
allowed by the physical topology, by a number of alternative paths. An in-depth analysis 
of direct path initialization (and restoration) is out of the scope of this work: for a more 
detailed description of this topic see, e.g., I1 111211511 411 .51 . 

DiffServ pipes are assigned bandwidth following the outcomes of a global high-level 
reallocation control policy (embedded in the bandwidth broker) that, based on the traffic 
matrix, computes the amount of bandwidth needed by the various pipes in order to keep 
their blocking probability below a given threshold (the detail of the reallocation control 
algorithm are given below). 

After such initial set-up phase, each link may have an amount of spare bandwidth left. 
This remaining bandwidth is treated link by link as a “dummy” pipe, named “bandwidth 
pool”. This bandwidth pool is useful for two purposes: the first one is to enhance the 
performance of the service separation policy (as we will see), and the second one is to 
give the network the necessary robustness in face of failures (e.g., congestion, physical 
failure). More in particular, the ’’bandwidth pool” may be used as follows. Whenever 
a portion of bandwidth of a given pipe is under-utilized, it will be added to the set of 
pools of all the physical links involved: the remaining bandwidth will then be just the 
necessary one to support the required QoS. Thus, if for a long period of time a pipe is 
only partially used, or not used at all, then its capacity will be gradually decreased and 
put into the pool. On the other hand, if a certain pipe is highly used and the probability 
of flow overload increases above a given threshold (or some other triggering event 
takes place), then, in real time, bandwidth can be taken from the physical link pools to 
enlarge the overloaded pipe. Figure |2 shows an example of this last case. The above 
scheme also solves the problem of the possible waste of bandwidth that may appear 
immediately after the initial assignment (i.e., the traffic matrix is not precisely known), 
since whatever initial amount of bandwidth given to a certain pipe will be adaptively 
adjusted over time, depending upon the actual utilization of such a path. 

Another aspect regards the choice of the amount of bandwidth to be shifted between 
the pool and a pipe, which can be made according to various criteria m-, in the simula- 
tions reported below, we have chosen to shift the amount corresponding to a single flow. 
The ’’optimal” choice of the bandwidth quantum is currently a matter of investigation. 

As said before, this pool (low-level) reallocation mechanism (also embedded in the 
bandwidth broker) enlarges or shrinks single pipes whenever needed without keeping 
into account the global information (i.e., the traffic matrix): the hrst pipe that needs 
bandwidth can get it at the expenses of the others. Hence, a second (high-level) control 
mechanism is needed in order to periodically fairly reallocate resources among all pipes. 
Such mechanism can be triggered in a number of ways. For example, it may start when 
the (Euclidean) distance between the currently estimated traffic matrix and the one 
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pool pool 




(a) initial assignment 



(b) applied assignment 



Fig. 1. An example of pool utilization. 



used in the previous reallocation is too large; it could also be activated when the pool 
interventions are too many. If traffic changes very slowly with respect to flow dynamics, 
such mechanism could be run periodically every given amount of time. 

2.1 The Routing Framework 

In the proposed system, routing and bandwidth control is distributed among different 
levels that, therefore, can be assigned quite simple tasks. In particular, this allows to 
scale down the complexity of the routing mechanism to the following classical simple 
and fast scheme: whenever a new flow comes up try to use the direct path. If the direct 
path has no room, then choose the “best” alternative two-hop route (see, e.g., IHfora 
number of different solutions in the choice of the alternative path); if there is not enough 
room over any of the alternative paths, then reject the flow. 

Figure Elshows the overall routing strategy. The scheme aims at meeting both per- 
formance and robustness requirements. In fact, the pool guarantees a higher level of 
control to better satisfy the required flow-level QoS by keeping the rejection rate low. 
Furthermore, this layered admission control mechanism (i.e., direct, alternative, pool) 
increases the overall robustness of the system. It is worth noting that routing a flow 
through an alternative path consumes twice the resources that would have been needed 
in normal conditions, i.e., for a direct path (in effect, even more that twice if more than 
two hops are allowed) and, hence, the load imposed to the network is doubled. A well- 
known risk in this framework is to have, in the long run, all (or the great majority of) 
flows set up through alternative paths, and this would dramatically decrease the network 
performance. A well known solution to this problem (also adopted in this work) is to 
reserve a minimum bandwidth in every pipe for direct flows only (trunk reservation; see, 

e.g., inii). 
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Level 3 



Level 2 



Level 1 



Fig. 2. Pictorial scheme of the overall multi-layer control system. 



2.2 Alternate Route Selection 

A route optimization mechanism may be employed in the choice of an alternative two- 
hop path. Generally speaking, three kinds of routing policies may be identified: static, 
dynamic (alternate), and adaptive routing; in all routing strategies, some technique should 
be used to choose the preferred route. One such technique is load sharing I17II . In the 
case of alternate routing, load sharing could be used not only for its simplicity, but also 
because it achieves some level of fairness. The static routing policy is the simplest one 
and also the one for which the largest collection of theoretical results is available. In its 
simplest form, the traffic stream between a source-destination pair is partitioned among 
a set of predefined pafhs. When an incoming flow comes up, one of fhe predefined pafhs 
(e.g., the one with largest residual capacity) is selected: if it is busy, then the flow is 
rejected. The alternate routing is quite similar to the previous one but, this time, the set 
of the alternative paths is ordered and the system looks for the first available one among 
them. The adaptive strategy does not order the list of alternative paths but, in real-time, it 
chooses the “best” one according to some criterion. In our case, as regards the second step 
of the routing mechanism (the choice of the alternative path), we adopt the third scheme 
with the load sharing technique, because it is tractable and meets the requirements of 
dividing the load over a subset of alternative paths. Since this is just a part of a more 
complex routing scheme and it is not adopted alone, its simplicity represents a point of 
strength. 

However, selecting any path should not violate the QoS requirements, so that a QoS 
handler algorithm must anyway be applied along the traversed path. We suppose in the 
following that the such algorithm is based on the knowledge of the maximum number of 
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aggregated flows acceptable over pipe r between nodes i and j to ensure class-/i 

flows the appropriate QoS requirements. Moreover, the following inequality constraint is 
applied to determine the bandwidth assignment to a certain pipe at any moment, bearing 
in mind that the capacity of any pipe might be varied, over time, to comply with the 
traffic intensities, using the bandwidth pool strategy: 



yh,ij > yh,ij 



( 1 ) 



where is the total amount of capacity of the considered pipe r, for service class h, 
and minimum capacity that is necessary to respect the QoS requirements 

for the flows in progress. If the new incoming flow could not be accommodated over 
one of the direct paths that connect the same SD pair, then an alternative path should be 
selected. The process of selecting an alternative path is activated if 



]\Th,ij 1 > Arh,ij 

’ r* ' — r,max 



( 2 ) 



where is the number of class-/i aggregated flows in progress over pipe r between 
nodes i and j. 

In performing the load sharing mechanism, the offered load matrix to the network 
(i.e., external incoming traffic intensities for each SD pair) is supposed to be known. 
Given also a set of alternative paths, the optimization algorithm identifies the share of 
flow for each alternative path. For example, suppose that for a given source-destination 
pair and for a given service, the offered load to the alternative paths is known to be, say, 
the equivalent of 30 Erlangs and suppose that 3 different alternative paths (besides the 
direct one) are available for that service. Suppose also that the algorithm computes a 
value for each of them of, say 10, 5 and 15 Erlangs, respectively: hence 10/30 of the 
total incoming flows will be routed over the first alternative path, 5/30 will be routed 
over the second one and 15/30 over the third one. This could be done either with a 
weighted round-robin mechanism or by employing some random mechanism with an 
appropriate distribution. Notice that the offered load we consider here represents, in 
effect, the amount of traffic which exceeds the one sent through the direct paths (i.e., the 
“overflow” traffic), and that we use the same pipes to carry both the direct and the two-hop 
traffic. We model the flow interarrival and duration statistics with negative exponential 
distributions. The traffic matrix will contain the offered load values (in Erlangs) for each 
service and for each source-destination pair. For the sake of notational simplicity, let 
us concentrate, for the time being, on one type of service (and one pipe for each SD 
pair), since all that follows can be separately applied to each service class. Let us call 
A the external input flow matrix, A*/ the element which represents the external offered 
load from node i to node j, and let O*/ represent the overflow traffic on the direct path 
(direct pipe) ij. Let us, furthermore, suppose to have K alternative two-hop paths and 
let us call (with a*/ > share of the overflow traffic 

routed through node k, ik = 1, . . . , K). We denote with the blocking probability 
over the entire path through k between switching nodes i and j. By approximating the 
pipe blocking probabilities as independent, we will then have: 

= 1 - (1 - - (1 - 



(3) 
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where is the blocking probability over the virtual link ik. Notice that an 

alternative path is composed of two pipes, hence ik denotes the first and kj the second 
pipe involved. The ’s are the solutions of the Erlang Fixed Point Equation , which, 
by using the Erlang B function, involves the offered traffic to pipe ik, and the maximum 
number of virtual circuits which can be set-up on it. 

The Erlang Fixed Point Equation, used to find the link blocking probabilities 
can be written as: 



Bik 









(4) 



where and represent the total offered load and the maximum number of 

aggregated flows that can be carried over the pipe ik (seeEJ and, by using the so-called 
“reduced load approximation’ ’ JHE 2 I, the total offered load to pipe ik can be expressed 
as 

Q^>^ = A^^+ Of{l-BP^)+ 01^(1- B'^^) ( 5 ) 

p€P{i) s£S{i) 

In|3 P{i) and S{k) represent the sets of predecessors of node i and successors of node k, 
respectively, along a two-hop path; moreover, we should keep in mind that the overflow 
traffic shares are represented by 



01^ = = ay 



(6) 



We can now write the overall blocking probability for the ij external traffic: 



= 1 - 



01 









K 






k=l 



(7) 



The optimization task takes place by minimizing the overall average blocking probabil- 
ity: 



iL = 



llil / 
ij / ^ 

m,n 



J^mn 

y ap<i^ 



( 8 ) 



subject to: 



> 0 , 



(9) 



When is given, the minimization algorithm yields, for each SD pair i,j, the 
sharing values for each alternative path ikj (k = 1, . . . , K). Since we only consider 
two-hop paths, the maximum number K of possible alternative paths for each SD pair 
will be equal to the number of nodes minus two (i.e., source and destination). This 
greatly decreases the computational effort of the optimization task. The latter has to be 
performed on-line whenever either a capacity reconfiguration takes place, or the values 
of the external offered load A^^ change significantly over time. 



A Bandwidth Broker Assignment Scheme in DiffServ Networks 



259 



2.3 Bandwidth Reallocation 

As said in the previous sections, the share of bandwidth held by each direct pipe is 
periodically updated by the two-level reallocation mechanism. In particular, the low- 
level (pool) mechanism looks at the blocking (estimated) probabilities computed in the 
previous reallocation period while, at a much lower pace, the high level control algorithm 
performs a global bandwidth reassignment by implementing some general policy (e.g., 
fairness) over all pipes. 

The low-level control mechanism, having the purpose of a fast short term reallocation 
and working on a local network scope, does not recompute the values of the best alternate 
routing parameters . Such parameters are, instead, recomputed upon reallocation of 
bandwidth by the high-level control algorithm. 



The low-level reallocation control mechanism. As regards the low-level (pool) control 
reallocation mechanism, the general framework is to enlarge (if possible) a direct path 
if its blocking probability exceeds a given threshold and shrink it (if possible) if its 
blocking probability is below a second threshold. The possibility to enlarge is subjected 
to bandwidth availability in all bandwidth pools along the DiffServ pipe, while the 
possibility to shrink is subjected to a lower bound on the bandwidth that is necessary to 
carry on the flows in progress: obviously no shrink is performed if the pipe is full. 

There are (at least) hve “levels of freedom” in performing the aforementioned real- 
location mechanism. The first is the choice of the time instants in which the reallocation 
procedure is to be considered; the second and the third are represented by the enlarging 
and shrinking threshold values; the fourth and the hfth are the amount of bandwidth that 
is subtracted or added to the pipe. 

In this work, we have chosen to measure the average blocking probabilities over a 
window of L events (i.e., births and deaths). Hence, every L events a (pool) reallocation 
might take place (based on the new estimated values) both for enlarging or shrinking all 
direct pipes that are “out of threshold”. 

For what regards the threshold values, in this work we have kept them hxed, even 
if we are investigating other possibilities. More in particular, since using fixed values 
means acting in an “open loop” way (in enlarging or shrinking the pipe, its current 
capacity is not taken into account), some other mechanism able to keep track of the pipe 
status would be desirable. A trivial example we have tried is represented by having the 
thresholds proportional to the pipe capacity. However, special care has to be taken in 
order to have the threshold variation dynamics properly chosen, which is a non trivial 
task: in this held the use of fuzzy controllers might be appropriate and is the matter of 
current investigation. 

Finally, the amount of bandwidth that is subtracted or added to the pipes has also been 
chosen hxed and equivalent to one How. Other schemes are currently under investigation 
and, as such, they are not reported here. The general assumption, however, is that the 
reallocation process is able to cope with load dynamics: we suppose that the load over the 
pipes changes with a rate that allows the reallocation mechanism to update the bandwidth 
properly. 
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The high-level reallocation control mechanism. The high-level reallocation control 
mechanism tries to give enough bandwidth to the DiffServ pipes so to keep the blocking 
probability given in 0), introduced in Section ITU below a given threshold, that is, for 
all source destination s-d, the algorithm tries to set Bgd < ^th- 

The skeleton of the algorithm works by iterating the following steps: 

1. assign 0 bandwidth to all pipes. 

2. pick the pipe^j^, from source node s to destination node d, with the highest blocking 

probability. 

3. if Bgd > Bth then assign a unit of bandwidth - if available - to pipegd 

4. go to 2. 

The algorithm ends either when all pipes have a blocking probability below the 
threshold or when it is not possible to further increase any pipes for lack of available 
bandwidth. At the end of the iterations, the unassigned bandwidth - if any - remains 
available for the pool. It might happen that the traffic matrix and the link capacities make 
the algorithm ending up in a situation where parts of the network have spare bandwidth, 
given to the pool, and other parts do not have sufficient bandwidth. 

At the end of this reallocation phase, the optimal values of the best alternate routing 
parameters a]^ are recomputed by means of the minimization in ® as introduced in 

Sectionini 

As regards the high-level control reallocation mechanism, the choice of the appro- 
priate triggering policy heavily depends on the type of traffic that characterizes a given 
network and, as such, is left to the network administrator. A simple one is to run it every 
given (fixed) number of incoming flows. A more efficient one waits for a given (fixed) 
number of low-level reallocation interventions. A further one is to trigger it whenever the 
“distance” (in an Euclidean way) between the latest and the present allocation matrix 
exceeds a given threshold. Estimated traffic matrix changes might be included in the 
triggering mechanism. In this work, we have chosen to set the high-level reallocation 
instants every R events (i.e., births and deaths). 

3 Simulation Results 

A number of simulation results, obtained over a simple test network, are reported and 
commented in this section. The aim is to investigate the performance of the above outlined 
global strategy. 

In our simulations we considered a physical network having a topology as in Eigure|3 
where each link has a total capacity of 150 Mbits/s. Over this, we have created the virtual 
network by iterating a backtracking algorithm, which exhaustively sets up the direct 
pipes, by attempting to spread them evenly over the physical links. As said in SectionQ 
the procedure is a heuristic one, and it is not among our goals to investigate this problem, 
which is in itself quite challenging. 

The values of the enlarging and shrinking thresholds for the activation of the pool 
are 0.1 and 0.01 respectively. 

Eor robustness purposes, alternative paths for all the source-destination pairs should 
be chosen among all the “two-hop” alternatives that do not share physical links (if possi- 
ble). However, since our aim is mainly to test the control strategy previously outlined, we 
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Fig. 3. The physical node topology of the testhed. 



do not consider this point here, and we allow all possible two-hop paths in our example 
network. As regards the traffic, for the sake of simplicity, we considered a single type, 
in which each flow has an equivalent bandwidth of 1 Mbits/s (under service separation, 
the extension to multi-class traffic is rather straightforward), generated according to a 
Poisson model, and having negative exponential duration. The arrival parameters (in 
Erlangs) are given in the traffic matrix (the average flow duration time is fixed to 150 
time units). The simulation runs last 300,000 events each, where an event can be either 
a flow arrival or completion. 

The traffic matrix A in Tabled shows the initial load (in Erlangs) for every source- 
destination pair. 



Table 1. Initial traffic matrix (in Erlangs). 



0 2.947 


2.982 2.989 


2.967 


2.948 


0 


0 


0 


2.971 


2.99 


2.972 


0 


0 


0 


2.97 


2.968 


2.946 


0 


0 


0 


0 


0 


2.983 


0 


0 


0 


0 


0 


2.942 


0 


0 


0 


0 


0 


0 



Given the matrix A, the high-level reallocation algorithm computes a bandwidth of 
approximately 6 Mbit/s for all pipes, given a blocking probability threshold Bth = 0.1. 

A very first comparison has been done regarding the pool. In order to highlight its 
benefits, one entry of the traffic matrix - namely the one from source 0 to destination 1 
- has been progressively increased (using a sine function) from its initial value of 2.947 
Erlangs to a final value of about 115 Erlangs. Eigure|3|show the overall average blocking 
probability over all SD pairs (which include both direct and alternate paths). As expected, 
it turns out that the presence of the pool dramatically increases the performance, regard- 
less of the alternate choice used. The system converges toward a point in which almost all 
flows can be routed through the direct path. EigureEJshows the behavior of the blocking 
probability of pipe 0-1 when its load increases. The pool mechanism increases the pipe 
size in order to keep below enlarging threshold (i.e., 0.1). Figure^lshows the effect of the 
thresholds on the behavior of both pool assignments and blocking probabilities, always 
for pipe 0- 1 . In this example, the network conditions have been (appositely) set such that 
the blocking probability always exceeds the shrinking threshold. Because of this, the 
pool enlarges the pipe but never shrinks it. The system operates about 50 shrink/enlarge 
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Block, prob. 



without ’pool’ 
with ’pool’ 




Fig. 4. Blocking probabilities: pool vs. no pool. 




time units 



Fig. 5. Blocking probability vs. increasing load: the effect of the low-level (pool) mechanism. 



operations on the whole network. FigureQ shows the effect of augmenting the shrinking 
threshold to 0.095 . In this example, always for pipe 0- 1 , it is evident the closer mirroring 
of the offered load operated by the pool with the assigned capacity. In this case, the pool 
reacts much more frequently: the system operates about 1600 shrink/enlarge operations 
on the whole network. Notice how the blocking probability much more closely sticks to 
its target value. 

Another effect of the pool is that, being local, it works on a “first come hrst served” 
basis and, therefore, another pipe also needing part of the shared bandwidth might 
remain unserved. This is shown in Figure |HI where also a second entry in the traffic 
matrix - namely the one from source 0 to destination 3 (also traversing link 0-1) - starts 
increasing, from its initial value of 2.989 Erlangs to 1 15 Erlangs, but later in time. Since 
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Fig. 6. The effect of the shrinking threshold (set to 0.01). 




Fig. 7. The effect of the shrinking threshold (set to 0.095). 



the first requiring pipe (i.e., 0-1) has already got much of the available bandwidth, the 
second increasing one remains unserved by the pool mechanism. 

Figured shows the benehts of the joint action of the two reallocation mechanisms. 
The plots refer to the blocking probabilities after the reallocation performed by the high- 
level global mechanism that redistributed evenly the bandwidth among the two pipes. 
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Block, prob. 




Block, prob 




4 Conclusions 

A measurement based real-time rearrangement policy for DiffServ pipes along with a 
new adaptive bandwidth reallocation scheme has been introduced and investigated. A 
“bandwidth pool” is proposed, to react to a high blocking probability of new incoming 
flows as well as to unused allocated bandwidth, while a high-level global reallocation 
control mechanism is adopted for adjusting distribution of shared resources over a long 
term basis. It has been shown by simulation that the use of the proposed mechanism 
positively affects the blocking rate. 
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Abstract. We propose a novel approach to Quality of Service, intended 
for IP over SONET (or IP over WDM) networks, that offers end-users 
the choice between two service classes defined according to their level of 
transmission protection. The first service class (called Fully Protected 
(FP)) offers end-users a guarantee of survivability: all FP traffic is pro- 
tected in the case of a (single) failure. The second service class (called 
Best-Effort Protected (BEP)) does not offer any specific level of protec- 
tion but is cheaper. When failures occur, the network does the best it 
can by only restoring as much BEP traffic as possible. We motivate the 
need for two classes of protection services based on observations about 
backbone network practices that include overprovisioning and an ongo- 
ing but unbalanced process of link upgrades. We use an ILP formulation 
of the problem for finding primary and backup paths for these two classes 
of service. As a proof of concept, we evaluate the gain of providing two 
protection services rather than one in a simple topology. These initial 
results demonstrate that it is possible to increase the network load (and 
hence revenue) without affecting users that want complete survivability 
guarantees. 



1 Introduction 

Today’s internet backbone contains a large amount of unused capacity due pri- 
marily to the following three reasons: overprovisioning, duplication of equipment 
and unbalanced link upgrades. Overprovisioning is the current de facto solution 
to providing QoS. 

A lot of effort is devoted to broadening the set of Internet services to a palette 
ranging from best-effort to real-time and streaming services. The proposed solu- 
tions for such services differ in the mechanisms they use - such as reservation or 
priority. However, their common goal is to provide users with a variety of service 
classes that differ based on their performance with respect to throughput, loss 
and/or delay measures. Such a differentiation is indeed useful when congestion 
occurs in portions of the network. But backbone networks are usually over- 
provisioned because it is often simpler and cheaper to buy additional hardware 
equipment than to run complex software for managing reservations and priori- 
ties in routers. Hence traffic rarely experiences congestion in the backbone 
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making service differentiation quite useless in practice. Overprovisioning allows 
carriers to provide everybody with the best class of service. 

Not only is the backbone overprovisoned to offer low delay and losses to all 
traffic, but most equipment is duplicated for protection against failures. Carriers 
are not willing to forgo this additional redundancy because they do not want 
network services to be disrupted, even rarely. (Failures are actually less rare 
than one might expect; |2| has recently reported failure rates of 1 per year per 
300km of fiber.) Avoiding service disruption is especially critical for backbone 
links, where a single failure may interrupt many channels. A large fraction of 
the capacity in the backbone links therefore remains unused, and this situation 
is likely to continue as long as the bottlenecks are in the access network rather 
than the backbone. 

On the other hand, because traffic demands grow exponentially, network op- 
erators are continuously obliged to upgrade the capacity of their backbone links. 
Upgrading a backbone link can be a lengthy operation, and thus in practice 
links are upgraded one at a time. Many months can pass between the upgrading 
of two links. Providing protection means that an upgrade of the capacity for 
primary working links, should be matched by an equivalent upgrade of the re- 
dundant protection links. However, since the network is essentially in a continual 
state of flux, the typical network is quite heterogeneous containing some recently 
upgraded high-speed links (e.g. links with a DWDM system of 80 to 160 wave- 
lengths operating at a 10 Gbps line speed), alongside older slower-speed links 
(e.g. WDM fibers with only 4 to 32 wavelengths at 2.5 Gbs line speed). This sit- 
uation prevents operators from making use of the capacity in recently upgraded 
links. To see why, consider the following scenario. Suppose all links are initially 
2.5 Gbps and then exactly one of them is upgraded to 10 Gbps. The full capac- 
ity of this link cannot be used for paths spanning multiple hops for two reasons. 
First, other links may not be able to support the growth in traffic, and second, 
it is unlikely that a backup path, on the other 2.5 Gpbs links, can be found for 
this additional traffic. 

The combination of overprovisioning, redundant capacity for failures, and 
partial network upgrades creates a situation in which, on a day-to-day basis, 
there exists a large amount of unused bandwidth in the Internet backbone. In 
order to leverage this unused bandwidth we propose the use of two classes of 
service that differ according to the protection level provided. The two service 
classes are intended for either IP/SONET or IP/ WDM networks with IP at the 
logical layer and either SONET or WDM systems at the physical layer. The first 
one, hereafter called the Fully Protected (FP) class, offers users the insurance 
that none of their traffic will be disrupted in case of a single point of failure. 
The second one, hereafter called the Best-Effort Protected (BEP) class, does 
not provide any specific guarantee on service disruption. Instead, in the case of 
failure, it offers to restore as much of the affected traffic as possible. What BEP 
offers to users, as a tradeoff for a lower amount of protection, is either a larger 
throughput, or a cheaper price. We will discuss how having two such services 
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allows carriers to carry the BEP traffic on the excess capacity without impactina; 
the FP traffic. 

Many proposals for service classes differentiate the classes according to their 
delay, loss or throughput performance. Reliability is also an important QoS per- 
formance metric and the wide variety of applications that exist today demand 
different levels of availability guarantees. Some applications, such as IP tele- 
phony, video-conferencing, and distance surveillance require 100% availability 
and hence full protection against network failures. Others, like on-line games, 
Web surfing, and Napster downloads are likely to be willing to tradeoff a partial 
and slower protection for increased throughput (or a lower price) . Such tradeoffs 
are attractive as long as the probability of a service becoming unavailable is very 
small. Applications like e-mail can fall into either one of these service classes. 

The reliability dimension of QoS can be quite independent of the traditional 
QoS parameters of delay, loss and throughput that are often correlated to one 
another. For example, two applications requiring similarly high levels of relia- 
bility need not have similar delay requirements. Most applications requiring full 
protection will be the priority traffic, but this may not always be the case. Ta- 
ble n demonstrates that categorizing applications by their protection needs can 
be different than categorizing them according to their traditional QoS needs. 
Reliability also differs from these traditional QoS measures in that delay, loss 
and throughput guarantees can be trivially satisfied by overprovisioning (if you 
are willing to pay for it), whereas reliability cannot because the amount of over- 
provisioning has to be carefully calculated. Overprovisioning to provide delay, 
loss and throughput guarantees can be done by simply inflating each link by say 
20 or 30%, or by ensuring that the load on each link rarely exceeds specified 
thresholds (e.g., 60%). However such a per-link view of overprovisioning is in- 
sufficient for meeting reliability guarantees which requires a network-wide view 
of the capacity. This is because all links must be inflated proportionally if one 
wants to ensure that backup paths will exist for all source-destination pairs of 
flows. The slow and unbalanced process of link upgrades makes it very difficult 
to overprovision using a network-wide perspective. 

The introduction of services offering different levels of protection guarantees 
at the WDM layer is gaining attention in the optical networking community. A 
classification in five classes is proposed in |3|. In our paper, we consider only two 
classes, but defined at the IP layer. This will result in making some SONET (or 
WDM) paths protected and others not, as in the work of Sridharan and Somani 

Despite some differences with this latter work in the problem formulation 
(for instance, we do not introduce different costs for each type of working or 
back-up paths, but we constrain the ratio between BEP and FP traffic to be less 
than a prescribed maximal value), we reach a similar conclusion, that the traffic 
load can be increased quite considerably when more then one class of protection 
is available in a homogeneous network, where all links have the same capacity. 
We show that this effect is even more accentuated in a heterogeneous network. 

The rest of this paper is organized as follows. Section O briefly summarizes 
the kinds of mechanisms provided for protection at the optical and IP layers. We 
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state our service definitions in Section 0 and describe which protection mech- 
anisms are needed by each of the service classes. The ILP formulation of the 
resulting routing problem is given in Section 21 For a proof of concept demon- 
stration, we provide an example in Section 21 that illustrates that a good deal of 
BEP traffic can be carried on the network without affecting the FP traffic, and 
thus we can substantially increase the load (and hence the revenue) the network 
carries. In Sectional we extend our ILP formulation in order to secure a minimal 
amount of bandwdith to restore a fraction of the BEP traffic after a failure, so 
that this class of traffic does not suffer a complete service disruption in case of 
a failure, but a softer degradation. We conclude our proposal in Section 0 



Table 1. Service Categorization 



Service 


Fully Protected Best Effort Protected 


Low delay and losses 


IP telephony, cheap on-line games 




distance monitoring 


Loose delay or 


professional e-mail private e-mail. 


loss requirement 


web surfing 



2 Handling Failures at the IP and SONET Layers 

Defining classes of service for protection requires specification of how protection 
is handled for each class. Before stating our proposal, we review the mechanisms 
that are available at each layer in the network. The optical layer provides protec- 
tion that carries out very fast failure recovery but is often not bandwidth efficient 
m- The IP layer can provide restoration that helps to determine more efficient 
routes but is typically not very fast. Most networks today rely on SONET to 
carry out protection. 

Protection at the SONET layer. All protection techniques involve pro- 
viding some redundant capacity within the network to reroute traffic in case 
of a failure. Protection is the mechanism by which traffic is switched to avail- 
able resources when a failure occurs. It needs to be very fast; the commonly 
accepted standard for SONET is 50 ms. Protection routes must therefore be 
pre-computed, and wavelengths must be reserved in advance at the time of con- 
nection setup. Protection around the failed facility can be done at different points 
in the network: (i) around the two end-points of the the failed link, by line or 
span protection (in optical layer terminology this corresponds to protection at 
the line or multiplex sublayer), or (ii) by path protection which is between the 
source and destination of each connection traversing the failed link (in optical 
layer terminology, this corresponds to protection at the path sublayer) p/lbl9j . 
Line protection is simpler, but path protection requires less bandwidth and can 
better handle node failures. Here, we only consider path protection. 
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There are essentially two fundamental protection mechanisms. In 1+1 pro- 
tection traffic is transmitted simultaneously on two separate fibers on disjoint 
routes. The receiver accepts traffic from the primary fiber (also called working 
fiber) and only switches to accept the input from the other fiber (called pro- 
tection or back-up fiber) in case the primary one fails. In 1:1 protection traffic 
is transmitted only on the primary fiber. If this fiber is cut, the sender and 
receiver use simple signalling to jointly switch to the backup fiber. The gener- 
alization of 1:1 protection is l:n protection where one back-up path protects n 
working paths. For our initial proof-of-concept analysis, we consider 1-1-1 and 
1:1 protection schemes in this paper. 

Restoration at IP layer. Since the IP layer is made up of a well meshed 
topology and its links are not fully loaded (due to overprovisioning), the IP layer 
is also capable of restoring traffic around a failed facility. 

After SONET protection is done, today’s routing protocols can discover 
routes in the new topology that are more efficient than the backup path used for 
failure recovery in the old topology. Within a carrier’s backbone. Internal Gate- 
way Protocols (IGP) are used for intradomain routing. IS-IS and OSPF are the 
most common protocols deployed today. In these protocols, routers periodically 
exchange hello messages to check the health of neighboring links and nodes. If 
a few successive messages are lost, a router deduces that a link or node is down, 
and begins the restoration process at the IP layer. After detection of a topology 
change, this process involves propagating the change information across the net- 
work and recomputing shortest paths. During the restoration process, a subset 
of destinations are reached through non-optimal routes (if the network supports 
SONET) or are briefly unreachable (otherwise). In IS-IS, the process of failure 
detection can take between 10-30 seconds depending upon the protocol config- 
uration, and the rest of the recovery process can take another 10 seconds or so 
fllH . Although ISIS convergence today takes on the order of tens of seconds, it 
is believed uni that these convergence times can be greatly reduced, potentially 
to the order of tens of milliseconds. The theoretical limit of link-state routing 
protocols to reroute is in link propagation time scales - in other words in the tens 
of milliseconds. Using today’s technologies, restoration speed at the IP layer can- 
not compete with the protection and restoration speeds at SONET (or WDM) 
layers. 

A difficulty that arises in today’s networks, e.g., IP/SONET, is that each 
layer performs protection independently from the other layers. For example, IGP 
routing table updates are performed independently of SONET’s line protection. 
This can lead to undesirable race conditions between different layers. Ideally 
IP and optical networks should be managed as an integrated network without 
overlap of functionality between layers and with sharing of information between 
layers. The issue of deciding exactly which aspects of protection and restoration 
should be carried out by which layer is still an open issue. The advantage of 
providing protection at the IP layer is the cost reduction that results from saving 
redundant equipment at the physical layer. The disadvantage is that it is slow. 
Providing protection at the SONET layer has the reverse tradeoff. 
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3 Definition and Provisioning of Service Classes 

We now define our two service classes that differ in terms of their level of pro- 
tection, their mechanism of protection and their cost. 

Fully Protected (FP) service class. 

— This service guarantees its customers that their traffic is protected against 
any single point of failure in the backbone within 50 msec. 

— This service provides fast protection. Therefore FP traffic is protected via 
pre-computed, dedicated back-up paths at the SONET or WDM layer, using 
either by 1:1 or 1-1-1 protection. Failures are transparent to the IP layer for 
this class of traffic. 

— This service is the more expensive of the two. 

Best Effort Protected (BEP) service class. 

— This service does not offer specific guarantees for protection against failures, 
but instead tries to restore as much of this traffic as possible after a the 
occurrence of a failure. 

— For BEP traffic we offer restoration and not protection. When a failure oc- 
curs, BEP packets will be dropped at the router before the point of con- 
gestion, until IP has been able to restore this traffic by rerouting it on an 
alternate IP path. Actually this service can come in a variety of flavors. The 
simplest version of this service class is to leave BEP traffic entirely unpro- 
tected at the SONET (and/or WDM) layer. A more enhanced version of this 
service (and more difficult to implement) is to ensure users that in case of a 
single failure, they would not experience a complete service disruption but 
may experience a severe degradation. 

~ This service is cheaper than the FP service. 

In order to implement two such service classes, packets would need to be 
marked according to their service class, and IP routers would need class-based 
scheduling. In normal operation, differentiation is not needed between the two 
types of packets. However, upon notification of a failure, FP packets continue to 
be served as before, while BEP packets are dropped until BEP traffic has been 
restored at the IP layer. 

4 ILP Formulation 

We formulate the problem of routing traffic ffows from two service classes over a 
physical and logical topologies as an Integer Linear Programming (ILP) problem 
whose objective is to maximize the total load carried by the network, which we 
denote by F. 

We consider here that all physical channels are SONET paths. They could 
also be WDM lightpaths, if all optical cross-connects have full wavelength con- 
version possibilities, or if they perform electronic conversion before switching, so 
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that wavelength continuity constraints 0 can be ignored. Each path is assigned 
a unit capacity, which represents the smallest granularity level of bandwdith of 
a SONET path. The total capacity Ci of physical link I, with 1 < I < L where 
L is the number of physical links, is thus an integer. 

The logical topology is the set of logical links between IP routers. Let M be the 
numbers of routers that are connected by a logical link. Each logical link between 
router s and router t has capacity dgt, and is a set of consecutive physical links 
that form a route r. Logical links are considered here as bi-directional, i.e. dgt 
is the sum of the demand from s to t and from t to s. In the following, we need 
thus to consider only source-destination pairs (s,t) with s <t. This assumption 
can clearly be relaxed. 

Each logical link presents a demand of dst capacity units at the physical 
layer, for which one needs to find a route r among the set of all routes TZgt 
between the source s and the destination t, such that the capacity constaints of 
all links I belonging to route r are satisfied. To keep routing at the physical layer 
simple, we do not allow multiple working paths between a given pair of nodes. 
Denoting by the traffic flowing on route r G TZst, we have thus that for all 
l<s<t<M 



dst = max {dlt}. 
relZst 



( 1 ) 



If multiple routes were allowed, one would have to change the maximum in this 
equation, by a sum. 

A logical link between a given pair of nodes (s,t) carries df/’ traffic units of 
the FP class, and traffic units of the BEP class: 



dst — d 



FP 

st 






BEP 

st 



(2) 



Because of (CJ), both traffic classes are carried on the same route r G TZst- 

In the simplest case, no precaution is taken to guarantee even a partial 
restoration of BEP traffic at the logical layer. This means that BEP traffic can 
be left unprotected, and will be restored only if resources are available after the 
failure has occurred. In the worst case, all BEP traffic may have to be dropped 
as a result of a failure. 

On the other hand, FP traffic is protected on a 1-1-1 or 1:1 basis. This is the 
simplest and fastest recovery scheme, but also the most resources consuming. It 
requires that for each primary route r G TZst, we find a link disjoint route r' (if 
we have only link failures) or even a link and node disjoint route (for the general 
case where both link and node failures can occur) from r, that can carry df/^ 
traffic units. We consider here only the case of link failure. To state the resulting 
constraint, we first introduce the membership function 

er _ J 1 if I & r 
^ 0 if ^ ^ r 



for any link 1 < I < L and any route r G TZst- We must therefore find a route 
r' G TZst such that the traffic demand on this protection route verifies for 
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all 1 < s < i < M, r G TZst 

(3) 

^ = (4) 

1<1<L 

Constraint m states that all traffic of the FP class, flowing on the primary route 
r, must be protected by a traffic allocation of the same amount on a back-up 
route r'. Constraint ensures that it is link disjoint with r. 

The finite link capacity imposes that for all 1 < Z < L 

M M 

EE T. + (5) 

s=l t=s+l r^Tlet 

The two last constraints are provided by the actual traffic data. 

The first one is the proportion of traffic belonging to both classes. We repre- 
sent the amount of BEP traffic between node pairs as a given multiple p of the 
FP traffic. Clearly, FP traffic will require more resources than BEP traffic, so we 
need to set a maximum value to this ratio, since otherwise the optimal solution 
will always consist in having all traffic in the BEP class. Therefore we constrain 
p to be less than than a given maximal value Pma.x- 

(jBEP ^ , iFP ('p'l 

“st — Pmax «st Ivl 



for all 1 < s < t < M. 

The second one is the fraction of the total load F that needs to be assigned 
between each pair of nodes, and which would be obtained by the IP traffic matrix 
data. By default, we assume here a balanced repartition of the load between each 
pair of IP nodes, so that the same fraction of the total load is assigned between 
each pair of nodes: 

dst — dg't' ( 7 ) 

for all 1 < s < f < M, 1 < s' < P < M. 

The problem amounts therefore to maximize 

M M 

^ = E E dst 

s=l 



subject to constraints (P) to {3). 

5 Example 

In this section, we illustrate our ideas with a numerical example. The goal of 
this example is to serve as a proof of concept to demonstrate the gain that can 
be achieved by supporting more than one protection service class. In today’s 
networks the only protection class is FP. 
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Physical topology (WDM/SONET layer) Logical topology (IP layer) 



Fig. 1. A SONET /WDM network (left) with working paths in plain and back-up paths 
in dashed lines. The logical topology at IP layer is represented on the right, and consists 
here of three logical links. 



Figure d shows a network, consisting of iV = 6 nodes and L = 8 links at 
the physical layer (SONET/WDM), and of M = 3 nodes and M{M — l)/2 = 3 
links at the logical layer (IP). We consider here that all physical channels are 
SONET paths. Remember that the capacity unit is the smallest capacity of a 
SONET path, and that the capacity C; of a link I is therefore an integer multiple 
of this capacity unit. In our example, the capacity of each phyiscal link is equal 
to 8, if the link has not been upgraded, and to 32, if the link has been upgraded. 
Figured shows one possible mapping of the logical links (right) on the physical 
links (left), which is as follows: 

— logical (IP) link (A, B) is mapped on working (SONET) physical path (or 
route) {(1,2)} and back-up (SONET) physical path {(1, 5), (5, 6), (2, 6)}; 

— logical link (A,C) is mapped on working physical route {(1,4), (3,4)} and 
back-up physical route {(1, 5), (3, 5)}; 

— logical (IP) link (B,C) is mapped on working physical route {(2,3)} and 
back-up physical route {(2, 6), (5, 6), (3, 5)}. 

Other mappings are of course possible, the mapping which will be eventually 
adopted is the one that solves the ILP described in the previous section. 

We use ILOG optimizer m to find the solution of the ILP. Figured displays 
the results, when the following number of links have been upgraded: (i) none, 
(ii) two links ((1,2) and (2,3)), (iii) four links ((1,2), (2,3), (3,4) and (1,4)) 
and (iv) all eight links. The x-axis denotes Pmax, which is defined by ®. 

First observe the case when all links have the same capacity, either before 
an upgrade or after an upgrade of all links. If we compare the scenario without 
any BEP traffic (pmax = 0), and a scenario with BEP traffic (pmax = 1), we 
see that we can nearly double the load on the network. As Pmax denotes the 
maximal ratio between BEP and FP, it is natural that after some value of Pmax, 
the curves become flat because no more additional traffic can be added in the 
system. 

Second, consider the case of a partial upgrade and say Pmax = 4 for example. 
If only two links are upgraded no additional FP traffic can be carried on the 
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Fig. 2. Total number of demands (total load) versus maximal ratio pmax of BEP traffic 
over FP traffic. 

network. However a good deal of BEP traffic can be added after the upgrades. 
In case of an upgrade of four (appropriately chosen) links, one even reaches the 
same capacity as with a full upgrade of all eight links, for pmax > 7. 



In the previous example, no precaution was taken to prevent BEP traffic from 
being dropped in case of a failure. It is however desirable that the connectivity 
of the IP layer be preserved after a single failure, so that BEP traffic is partly 
restorable (by partly restorable, we mean here that every IP node is reachable, 
but that queuing delays may become signifigant). 

This imposes an additional constraint on the mapping of the logical topology 
on the physical topology, namely that a single failure leaves the logical topology 
connected. This problem has been shown to be NP-complete and therefore 
requires heuristics for general logical and physical topologies. However, when the 
logical topology is a ring (as in our example), this constraint becomes particularly 
simple to state m- one must simply check that no physical link is shared by two 
logical links, since otherwise the failure of such a physical link would leave the 
logical topology un-connected. In other words, we now introduce the additional 
constraint that for all 1 < s < t < M, 1 < s' < t' < M, with (s, t) ^ (s', t'), and 
for any r G 72.st and r' G 'R-s't' 
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In this case, the curve in Figure Q for 2 upgraded links coincides with the 
curve for zero upgraded link. However, the curve for 4 upgraded links remains 
unchanged. 

In this network, after a single failure on any link of the network, all FP traffic 
can therefore be rerouted on alternate routes offering the same capacity, whereas 
BEP traffic that used the broken link now needs to share routes with other BEP 
flows. As a result, congestion will occur for BEP traffic. In the example above, 
it is easy to check that the capacity offered to all BEP traffic after a failure is 
half the capacity it had before. 

A better service would be provided for the BEP class, if we slightly over- 
provision the links taken by BEP traffic, so that it has some spare capacity from 
which it can benefit to absorb occassional bursts of traffic when no failure has 
occurred, and to offer a less severe degradation after the occurrence of failure, 
to rerouted BEP traffic. 

Let us denote by e the amount of over-provisioning we provide to the traffic 
class. This means that for every demand of BEP traffic units between s 

and t, we will actually reserve {l-\- capacity units. Because of the logical 

ring topology of our example, one can check that a single failure will then always 
leave a fraction (1 -|- e)/2 of the capacity needed for restoring BEP traffic at the 
IP layer. The value e = 0 corresponds to the previous case, where BEP traffic 
receives half the traffic it has before a failure. A value s = 1 corresponds to a 
fully restorable BEP traffic at the IP layer (in which case the only difference 
between the FP and BEP traffic is the layer, and thus the speed, at which traffic 
is restored). 

This amounts to replacing constraint ©by 

M M 

i; H H (<• + + <;'■“) < o 

s—1 t—s+1 rGlZst 

where because of m and 0. Of course, we need to keep © in 

the set of constraints. 

Figure 0 shows the resulting total load when e = 0.5. Because of the over- 
provisioning, the total load has decreased, compared to the scenario depicted in 
Figure 0 In this new scenario, upgrading 4 links no longer allows the network 
to reach the same total load level as in the case of upgrading all 8 links. This 
is to be expected as it illustrates the tradeoff between carrying extra load and 
providing (partial) restoration. However there is still a sizable gain in having 
two protection services. For example, in this case of partial restoration for BEP, 
with 4 links upgraded our approach can double the amount of new load carried 
as compared to a system with a single service (p = 0) 

7 Conclusion 

We proposed two service classes based on the level of reliability required by users. 
The FP class ensures fast protection at the SONET or WDM layer, and makes 
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Fig. 3. Total number of demands (total load) versus maximal ratio pmax of BEP traffic 
over FP traffic, when e = 0.5, so that a fraction of 0.75 of the BEP traffic can be 
restored. 



failures transparent to the IP layer. The BEP class does not offer any avail- 
ability guarantees after a failure, and is left unprotected at the SONET and/or 
WDM layers. This proposal allows carriers to make good use of a few upgraded 
backbone links that otherwise would provide limited benefit until the majority 
of the backbone links have been similarly upgraded. Preliminary results show 
that in heterogeneous networks resulting from partial upgrades, our approach 
allows a substantial amount of additional traffic to be carried. In particular, we 
showed that in the case of our simple topology, when half of the network links 
are upgraded the amount of new load carried can be doubled or tripled (de- 
pending upon the amount of protection offered to BEP users) as compared to 
an environment that supported only a single full protection service. Our results 
demonstrate that by having a second protection class of service, carriers achieve 
a new method of generating revenue without harming their existing protection 
class of service. 

Further research should investigate these benefits for larger physical topolo- 
gies, and meshed logical topologies. This approach should also be refined more 
generally (not just for ring topologies) so that the BEP traffic can secure some 
level of restoration at the IP layer. 

Finally, if SONET is no longer the layer handling failures, and if optical cross- 
connects do not perform wavelength conversion, then MPLS may be needed to 
map IP traffic directly on the lightpaths 1 1 41 1 ,bj . The MPLS protocol indeed 
offers a potential alternate mechanism for providing protection and restoration 
at layers 2/3. MPLS is a general purpose tunneling mechamisn that uses a sim- 
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pie label-swapping forwarding technique to transport IP packets across an IP 
network. It creates tunnels, called Label Switched Paths (LSPs) by distribut- 
ing labels along a path of MPLS-capable routers. The LSP tunnel essentially 
sets up a path through a network of connectionless IP routers. MPLS is suited 
for survivability for a few reasons. LSPs can be used as backup paths and can 
be computed in advance. This requires storing extra labels in a forwarding ta- 
ble. Also, MPLS is not dependent upon IGP convergence since backup LSPs 
can be established a priori. Research in the performance of MPLS restoration 
mechanisms is still immature. However it is hypothesized that for link failures, 
link protection can occur within tens of milliseconds since no signalling is re- 
quired. Yet path protection is expected to take on the order of seconds because 
this would require some signalling to inform the head of the tunnel about the 
topology change. 
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Abstract. The original design of the Internet and its underlying protocols did 
not anticipate users to be mobile. With the growing interest in supporting 
mobile users and mobile computing, a great deal of work is taking place to 
solve this problem. For a solution to be practical, it has to integrate easily with 
existing Internet infrastructure and protocols, and offer an adequate migration 
path toward what might represent the ultimate solution. In that respect, the 
solution has to be incrementally scalable to handle a large number of mobile 
users and wide geographical scopes, and well performing so as to support all 
application requirements including voice and video communications and a wide 
range of mobility speeds. In this paper, we present a survey of the state-of-the- 
art and propose a multi-layer architecture for mobility in IP networks. In 
particular, we propose the use of extended local area networks and protocols for 
efficient and scalable mobility support in the Internet. 



1 Introduction 

In the broadest sense, the term mobile networking refers to a system that allows users 
to maintain network connectivity while moving from one location to another. 
Mobility is often associated with wireless technologies that require mobile networks 
to support continuous movement, at high speeds and for long periods of time. 
Recently, there has been an explosive growth in wireless devices with built In access 
to the Internet. In the near future, large numbers of mobile users will access the 
Internet for a variety of high-speed multimedia services. IP packet switching has 
become the standard towards which many networks are converging, including those 
in the telecommunication sector, as less efficient, less enabling circuit-switched 
technologies are abandoned. Although a lot of progress has been made, supporting 
mobility in IP networks is still a difficult challenge. 



1.1 Challenges and Early Solutions 

1.1.1 Duality of IP Addresses 

The IP addressing scheme was designed and optimized for a stationary environment, 
which makes mobility difficult. With the introduction of mobile networking, IP 
addresses have acquired a dual significance. On one hand, they are expected to remain 
fixed during the course of a connection. An important reason for this is that, while, in 
principle, higher layers (above IP in the protocol stack) are supposed to be 
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independent of the IP layer, in practice they make use of the IP address for basic 
functionality. For example, the transport layer uses this address to establish and 
maintain connections. If the IP address is changed during the course of a session, the 
connection is lost and the session is terminated. Therefore, to maintain seamless 
connectivity during movement, IP addresses need to be kept fixed, transparent to 
changes in user locations. On the other hand, IP addresses need to change 
dynamically as users move, since they are used for packet routing and delivery. 
Routers in the Internet use IP addresses in the destination field of packets to identify 
the subnet where the user is located, and to obtain the MAC address of the user for 
final delivery of the packets. Moreover, typically routers in the Internet use address 
based filtering to discard packets whose source IP address are from outside the 
subnet. Therefore, the user IP address needs to change as the user changes location in 
order to conform to addressing at the new locatioij]] 

Notice that mobile networking inside a subnet is not affected by the dual 
significant of IP addresses. Mobile users can roam inside a subnet without having to 
update their IP addresses. The reason why this is possible is because LAN switches 
learn the users location and can route packets to them quickly using this information. 

One way to resolve the duality of IP addressing is to change the transport and 
application layers of the protocol stack in order to handle a dynamic IP address. A 
mobility solution at the TCP layer is proposed in [3]. Connection migration is 
performed to maintain connectivity for sessions in-flight at the time of move. For this 
solution to work, the mobile hosts, and fixed hosts in the Internet wishing to 
communicate with mobile hosts would need to be upgraded to the new versions of 
software. While upgrading the mobile hosts may be an easier task, upgrading all the 
hosts in the Internet is not a possibility. Furthermore, achieving good application 
performance with dynamic IP addresses remains a significant challenge. Simulation 
results show that significant disruption is incurred during migration; moreover the 
solution limits movement to a single end and also may apply to TCP applications 
only. Another transport layer solution, proposed in [18] suggests TCP be modified to 
use domain names instead of IP addresses. Again, the main disadvantage of this 
solution is that it does not integrate easily with the existing Internet, hence it could be 
prohibitively expensive to deploy. 

The alternative solution to the problem of IP address duality is to allow hosts to 
maintain a fixed IP address as they move across subnets. In turn, this would require 
that routers propagate host-specific routes in the Internet. However, host-specific 
routing requires space in the routing tables proportional to the number of hosts, slows 
down the routing process and consumes potentially excessive bandwidth in the 
Internet. 

1.1.2 Mobile IP 

In the 1990’s, the IETF designed a solution for mobility known as Mobile IP [4], 
which overcomes the duality of IP addresses without requiring that routers learn host- 

' Note that, to resolve this duality, in Ipv6 a host is allowed to use two addresses. One address 
is used as a permanent identifier while the other address is used for routing purposes. The 
permanent address is included in the main Ipv6 header, while the routing address is inserted 
in a special-purpose extension header used for routing. 
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specific routes. Mobile IP solves the problem by allowing a single computer to hold 
two addresses simultaneously. The first address is permanent and fixed. It is the 
address that transport and application protocols use. The second address is temporary 
- it changes as the computer moves, and is valid only while the computer visits a 
given location. 

A mobile host MH is assigned a permanent home address and a home agent HA in 
its home subnet. DNS maps the domain name of the host to its home address. When 
the MH moves to a foreign subnet, it acquires a temporary care-of-address COA from 
an advertised foreign agent in the subnet, and it registers its new address with the HA. 
The HA uses gratuitous proxy ARP to capture all IP packets addressed to the MH’s 
address and uses encapsulation to forward them to the mobile's current 

There are two possibilities for the packets going back from the MH to the 
corresponding host CH. One choice is for the MH to send out un-encapsulated IP 
packets with the permanent home address of the MH as the source address and the 
address of the CH as the destination address. However, some routers in the Internet 
use address based filtering and discard packets from outside the subnet. To avoid this, 
the MH needs to encapsulate the packet using its COA as the source address and the 
address of the HA as the destination address. The HA decapsulates the packet and 
forwards it to the CH. 

One thing to notice is that packets delivered via HA typically travel further through 
the Internet than they would if delivered by the optimal unicast route. Apart from 
increasing the round-trip delay observed by the communicating parties, this also 
affects other users by increasing the overall load on the shared resources of the 
Internet. A proposed mechanism, known as route optimization, attempts to fix this, by 
using binding updates, containing the current COA of the MH, from the HA to the 
CH. A CH with enhanced networking software can learn the temporary COA and then 
perform the encapsulation itself, sending the packet directly to the mobile host. This 
avoids the overhead of indirect delivery. 

1.1.3 Industrial Solutions and Mobile IP 

Mobile IP, or some variant thereof, is a popular solution adopted by the majority of 
industrial products offering IP connectivity to mobile users. 



permanent 

location|] 



1.1. 3.1 Ricochet 

The Ricochet system from Metricom [17] implements a solution for IP mobility that 
is similar to the Mobile IP protocol. However, it is important to point out that 
Ricochet was designed more than a decade ago hence it predates the Mobile IP 
protocol. Wireless cells are connected to IP gateways and name servers that provide 
security, authorization and roaming support to users. At any given point in time, a 
user has three addresses: one IP address, which is fixed, and two layer-2 addresses: 
one is fixed and unique to that user, and the other is dynamic and unique to the cell 
where a user is located at that point in time. When a user first connects to the network, 
its request is validated by the local gateway and name server. If authorized, the 



^ Note that for Ipv6, the extension header plays the role of the encapsulation header in Mobile 
IP. 
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gateway provides the user with an IP address that identifies a permanent virtual 
connection between the user and the network. All Internet traffic for the user is 
tunneled through the gateway to which the user was originally connected. The 
gateway maps the IP address of the user to the layer-2 address of the user 
corresponding to the cell where the user is located. As the user crosses cells, the IP 
address it had acquired from the gateway remains fixed. However, the mapping of this 
address to a cell location changes to reflect the new location of the user. In essence, 
this gateway performs the function of a agent in Mobile IP, by providing the user with 
an IP address, and tunneling the traffic for that user to its most up-to-date location. 

1. 1.3.2 UMTS 

One example of an industrial system that uses the Mobile IP protocol is the Universal 
Mobile Telecommunication System (UMTS), which is proposed in [10]. UMTS aims 
to provide IP level services via virtual connections between mobile hosts and IP 
gateways connected to ISPs or corporate networks at the edges of the mobile network. 

Users are assigned domain names, which are used to identify the ISP that can be 
accessed to provide Internet connectivity to the user. When a user logs on, it is 
assigned an IP address by the gateway to which that ISP is connected, also known as 
the home gateway. A virtual connection is established, consisting of two segments: 
one segment connects the mobile and some foreign gateway (via the air interface), 
and another, connects the foreign gateway and the home gateway (via a protocol 
similar to Mobile IP). The virtual connection is maintained as long as the mobile 
remains on and the foreign gateway can be changed as the mobile roams from the 
coverage area of one gateway to another. One can think of the mobile as being linked 
to the home gateway via an elastic global pipe. To the external world, the mobile 
appears to be located at the home gateway because it is this gateway that provides the 
IP address for the mobile. This mobility model is similar to the Mobile IP protocol. 

1.1.4 Other Challenges 

Another challenge of mobile networking is to support multimedia applications with 
stringent performance constraints, such as low packet loss and high interactivity. As 
users move, handoff needs to take place between the user’ s old point of attachment to 
the network and the new point of attachment to the network. Handoff may require 
change of state, not only at routers in the network to which a user is immediately 
connected, but also at routers inside the Internet that deliver packets to that user. If the 
number of such routers is large, or the distance from the user to these nodes is large, 
this change of state can take a long time. An interruption in connectivity due to a slow 
handoff can cause packet loss, which can significantly lower the perceived quality of 
these applications by the user. Packet buffering is typically used to handle packet loss. 
However, packet buffering may result in excessive latency overhead. Real-time 
applications such as voice depend on packets being delivered at a constant rate and 
within a certain time budget. If at times of user movement, the network cannot ensure 
the timely delivery of packets, they become irrelevant and would need to be 
discarded, to the dissatisfaction of users. Therefore, to assist moving users and 
maintain the continuity of multimedia traffic, the solution needs to support fast 
handoffs. To achieve this, handoffs should not involve propagation of information 
over long distances (hence should be handled in the vicinity of the user location). 
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Also in order to achieve smooth handoffs, it may he necessary for the mobile user to 
be connected to multiple points of attachments (referred to as diversity in the 
literature). However, this becomes difficult as speeds increase and users move 
continuously across space, frequently changing their points of attachment. 

Unfortunately, Mobile IP does not meet the challenge of fast handoff. Rather than 
attempting to handle rapid network transitions such as the ones encountered in a 
wireless cellular system. Mobile IP focuses on the problem of long-duration moves. 
Let us refer to movement that requires fast handoffs as micro-mobility. The reason 
why Mobile IP does not perform well for micro-mobility should be clear: after it 
moves to a new subnet, a MH must detect that it has moved, communicate across the 
foreign network to obtain a COA and then communicate across the Internet to its HA 
to arrange forwarding. Because it requires considerable overhead after each move, 
mobile IP is intended for situations in which the MH crosses subnets infrequently, e.g. 
when the MH remains at a given location for a relatively long period of time. 



2 Accelerating Micro-mobility 

Many researchers have investigated ways to improve Mobile IP by accelerating 
micro-mobility. Subnets in the Internet are grouped into domains. Inter-domain 
mobility is achieved using Mobile IP, while intra-domain mobility is achieved using 
techniques that are particular to each research scheme. Routers or switches inside the 
domain keep track of users and deliver traffic to them using their learning databases. 
To perform this function, these devices effectively implement host-specific routing or 
switching. Traffic between domains is exchanged via routers typically known as 
gateways. The basic idea is shown in Fig. 1. 




Fig. 1. Typical Micro-Mobility Architecture 
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2.1 HFA 

In [1] hierarchical foreign agents (HFAs) are introduced to smooth out the handoff 
process when a mobile host MH transitions between subnets. This optimization is 
accomplished via hierarchical tracking of mobile hosts (MHs) by the foreign agents 
(FAs) and via packet buffering at FAs. 

The FAs of a domain are organized into a tree structure that handles all the 
handoffs in that domain. The tree organization is unspecified and left up to the 
network administrator of that domain. One popular configuration is to have a foreign 
agent associated with the firewall to that domain be the root of the tree (also known as 
a gateway foreign agent or GFA) and all the other foreign agents provide the second 
level of the hierarchy. 

An FA sends advertisements called Agent Advertisements in order to signal its 
presence to the MHs. An Agent Advertisement includes a vector of care-of addresses, 
which are the IP addresses of all its ancestors as well as the IP address of that FA. 
When an MH arrives at an FA, it registers the FA and all its ancestors with its home 
agent HA. The registration is seen and processed by the FA, all its ancestors and the 
HA. 

When a packet for the MH arrives at its home network, the HA tunnels it to the 
GFA. The GFA re-tunnels it to the lower-level FA, which in turn re-tunnels it to the 
next lower level FA. Finally, the lowest-level FA delivers it to the MH. Therefore, an 
FA processing a registration should record the next lower-level FA as the other end of 
the forwarding tunnel. 

Mobile IP route optimization extends the use of binding cache and binding update 
messages to provide smooth handoff via previous FA notification. However, tunneled 
packets that arrive at the previous FA before the previous FA notification are still lost. 
Such data loss may be aggravated if the MH loses contact with any FAs for a 
relatively long period of time. HFA includes an additional FA buffering mechanism. 
Besides decapsulating tunneled packets and delivering them directly to an MH, the 
FA also buffers these packets. When it receives a previous FA notification, it re- 
tunnels the buffered packets along with any future packets tunneled to it. Clearly, how 
much packet loss can be avoided depends on how quickly an MH finds a new FA, and 
how many packets are buffered at the previous FA. This in turn depends on how 
frequently FAs send out beacons or agent advertisements, and how long the MH stays 
out of range of any FA. To reduce duplicates, the MH buffers the identification and 
source address fields in the IP headers of the packets it receives and includes them in 
the buffer handoff request so that the previous FA does not need to retransmit those 
packets that the MH has already received. 

While HFA helps reduce the overhead of handoff by handling handoff closer to the 
MH, it adds latency due to the need for packet encapsulation and decapsulation at 
every FA in the FA tree along the path from the CH to the MH. Moreover, scalability 
issues arise at the root FA and the FAs close to the root of the FA tree because of their 
involvement in packet tunneling for all the MHs of that domain. Finally, packet 
buffering results in latency overhead, while encapsulation still generates bandwidth 
overhead. 
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2.2 Cellular IP 




Fig. 2. A Wireless Access Network in Cellular IP. Base stations in an access network are 
interconnected by wired links. One gateway controls each access network. 

Cellular IP access networks, depicted in Fig. 2, are connected to the Internet via 
gateway routers [9]. Cellular IP uses base stations for wireless access connectivity, for 
IP packet routing and for mobility support inside an access network. Base stations 
(BSs) are built on regular IP forwarding engines except that IP routing is replaced by 
Cellular IP routing. MHs attached to an access network use the IP address of the 
gateway as their Mobile IP care-of-address. The gateway de-tunnels packets and 
forwards them toward a BS. Inside a Cellular IP network, MHs are identified by their 
permanent home address and data packets routed without tunneling or address 
conversion. The Cellular IP routing protocol ensures that packets are delivered to the 
host's actual location. Vice-versa, packets sent by the MH are directed to the gateway, 
and from there, to the Internet. 

Periodically, the gateway sends out beacons that are broadcasted across the access 
network. Through this procedure, BSs learn about neighbouring BSs on the path 
towards the gateway. They use this information when forwarding packets to the 
gateway. Moreover, when forwarding data packets from users to the gateway, BSs 
learn about the location of a user, and use that information to deliver packets sent for 
that user. 

If a packet is received at a BS for a user that is unknown to that BS, a paging 
request is initiated by the BS. The paging request is broadcasted across a limited area 
in the access network called a paging area. The MH responds to the paging request 
and its route to the paging BS gets established. Each MH needs to register with a 
paging area when it first enters that area, regardless of whether it is engaged in 
communication or idle. Clearly, how fast paging occurs depends on the size of the 
paging area and on the efficiency of spanning tree traversal. A small paging area can 
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help reduce the latency of paging, however it increases the number of paging area 
required to cover a given area, which in turn increases the signalling overhead 
imposed on MHs. 

We observe that the paging techniques in Cellular IP are similar to those existent in 
the Groupe Speciale Mobile system (GSM) [13]. Mobile users are located in system- 
defined areas called cells that are grouped in paging areas. Every user connects with 
the base station in his cell through the wireless medium. Base stations in a given 
paging area are connected by a fixed wired network to a switching center, and 
exchange data to perform call setups and deliver calls between different cells. When a 
call arrives at the switching center for a given user, a paging request for that user is 
initiated across all the cells in that paging area. If the user answers, a security check 
on the user is performed, and if the test passes, the switching center sets up a 
connection for that user. 

Cellular IP supports two types of handoff: hard handoff and semisoft handoff. MHs 
listen to beacons transmitted by BSs and initiate handoff based on signal strength 
measurements. To perform a handoff, the MH tunes its radio to the new BS and sends 
a registration message that is used to create routing entries along the path to the 
gateway. Packets that are received at a BS prior to the location update are lost. Just 
like in Mobile IP, packet loss can be reduced by notifying the old BS of the pending 
handoff, and requesting that the old BS forward those packets to the new BS. Another 
possibility is to allow for the old route to remain valid until the handoff is established. 
This is known as semisoft handoff and is initiated by the MH sending a semisoft 
handoff packet to the new BS while still listening to the old BS. After a semisoft 
delay, the MH sends a regular handoff packet. The purpose of the semisoft packet is 
to establish parts of the new route (to some uplink BS). During the semisoft delay 
time, the MH may be receiving packets from both BSs. The success of this scheme in 
minimizing packet loss depends on both the network topology and the value of the 
semisoft delay. While a large value can eliminate packet loss, it however adds burden 
on the wireless network by consuming precious bandwidth. 

Cellular IP specifies an algorithm to build a single spanning tree rooted at the 
gateway to the access network as we described above. A spanning tree is necessary 
for the broadcasting of packets, to avoid packets from propagating to infinity if the 
topology of the access network has any loops. However, because it uses only a subset 
of the links inside the access network, a single spanning tree can result in link 
overload if traffic in the access network is high. This can be a significant drawback of 
Cellular IP as high-density access networks supporting many Tb/s of traffic become 
possible to deploy. Moreover, a single spanning tree can be prone to long periods 
connectivity loss. Connectivity loss would make this technology unacceptable as a 
replacement to wired, circuit-switched technology for telephone communications. 
Finally, Cellular IP specifies an interconnect between base stations that has a flat 
hierarchy. As access networks cover more area and exhibit higher pico-cell densities, 
a flat hierarchy would result in latencies of packet traversal across the access network 
that are unacceptable. 

The description of Cellular IP assumes that originally, each wireless cell (or even 
pico-cell) constitutes an IP subnet. Consequently, they propose that multiple wireless 
cells be grouped into one subnet to improve roaming between the cells of one subnet. 
However, this concept is not new. For example, the 802.11 standard uses Extended 
Service Sets (ESS) to interconnect multiple 802.11 cells within a single subnet. 
Cellular IP also proposes two protocols for configuration and routing in IP subnets. 




IP Routing and Mobility 287 



however, LAN protocols already exist to accomplish these goals. For example, the 
algorithms for building a spanning tree and for learning as defined by the 802 
standards are widely deployed and well known. 

Nonetheless, it is clear that deploying wireless access networks as single subnets, 
like in Cellular IP is important for mobility. In this light, it becomes important to 
increase the size of IP subnets to the largest size possible in order to maximize their 
effectiveness in supporting IP mobility. 



2.3 Hawaii 




Fig. 3. Diagram of a Domain in the Hawaii Architecture. A domain root router acts as the 
gateway to each domain. Paths are established between the routers of a domain. 

HAWAII segregates the network into a hierarchy of domains, loosely modeled on 
the autonomous system hierarchy used in the Internet [14]. The gateway into each 
domain is called the domain root router. When moving inside a foreign domain, an 
MH retains its COA unchanged and connectivity is made possible via dynamically 
established paths, as shown in Fig. 3. Path-setup update messages are used to 
establish and update host-based routing entries for the mobile hosts in selective 
routers in the domain, so that packets arriving at the domain root router can reach the 
mobile host. The choice of when, how and which routers are updated constitutes a 
particular setup scheme. HAWAII describes four such setup schemes, which trade-off 
efficiency of packet delivery and packet loss during handoff. The MH sends a path 
setup message, which establishes host specific routes for that MH at the domain root 
router and any intermediary routers on the path towards the mobile host. Other routers 
in the domain have no knowledge of that MH's IP address. Moreover, the home agent 
and communicating host are unaware of intra-domain mobility. The state maintained 
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at the routers is soft: the MH infrequently sends periodic refresh messages to the local 
BS. In turn, the BS and intermediary routers send periodic aggregate hop-by-hop 
refresh messages toward the domain root router. Furthermore, reliability is achieved 
through maintaining soft-state forwarding entries for the mobile hosts and leveraging 
fault detection mechanisms built in existing intra-domain routing protocols. 

HAWAII exploits host-specific routing to deliver micro-mobility. By design, 
routers perform prefix routing to allow for a large number of hosts to be supported in 
the Internet. While routing based on host-specific addresses can also be performed at 
a router, it is normally discouraged, because it violates the principle of prefix routing. 
Furthermore, host-specific routing is limited by the small number of host-specific 
entries that can be supported in a given router. However, this concern can be 
addressed by appropriate sizing of the domain and by carefully choosing the routers 
that are updated when a mobile is handed off. One of the problems with the 
implementation of HAWAII is that a single domain root router is used. This router, as 
well as its neighbors inside the routing tree can become bottlenecks routers for the 
domain for two reasons: First, they hold routing entries for all the users inside the 
domain. Second, they participate in the handling of all control and data packets for 
that domain. Another disadvantage of HAWAII comes from its use routers as a 
foundation for micro-mobility support. With cells becoming smaller, it is possible that 
a larger number of routers would be needed for user tracking and routing in a given 
area; however, this can become prohibitively expensive. 



2.4 Multicast-Based Mobility 

Numerous multicast-based mobility solutions have been proposed [2,6,7]. In [2,6], 
each mobile host is assigned a unique multicast address. Routers in the neighborhood 
of the user join this multicast address, and thus form a multicast tree for that address. 
Packets sent to the mobile host are destined to that multicast address and flow down 
the multicast distribution tree to the mobile host. In [7], packets are tunneled from the 
home agent using pre-arranged multicast group address, to which a set of neighboring 
base stations in the vicinity of the mobile host adhere. The most significant drawback 
of these solutions is that they require routers to be multicast capable; this capability 
does not exist in the Internet routers of today and would need to be added. In essence, 
this solution requires that routers learn multicast addresses, in the same way that 
routers learn unicast addresses in the other schemes for micro-mobility that we 
discussed. Unlike LAN switches, routers are not designed to learn host addresses, and 
therefore they would need to be modified for this purpose. Other drawbacks of 
mobility schemes based on multicast routing are that they require unique multicast 
addresses to be used, which creates address management complexity and limits the 
addressing space. 



2.5 Micro-mobility and LAN Switching 

In all the solutions we presented, fixed IP addresses are used to track mobile users 
inside a domain. This is done via learning at base stations, routers or agents. Despite 
the use of IP addresses, which are hierarchical, the addressing structure within a 
domain becomes non-hierarchical, just like in a LAN. Consequently, these addresses 
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are tracked in the same fashion as layer-2 addresses in LANs. We make the 
observation that, in fact, these addresses are tracked in the same way as virtual 
channel identifiers in circuit-switched solutions such as ATM (employed in UMTS 
for the tracking of users by foreign gateways). In their original design, routers were 
not intended for performing tracking of individual host addresses, and consequently 
do not perform host-specific routing in an efficient way. It is unlikely that routers 
designs will be modified for this purpose. By design, layer-2 switches track host 
addresses, hence represent a more suitable solution for mobile tracking inside a 
domain. 



3 A Multi-layer Infrastructure for Mobility 

Our view is that an architecture to handle mobility must operate in a hierarchical 
fashion by providing functionality at multiple layers; namely, the MAC layer, the 
networking layer and DNS (or the “directory” layer). Each layer is suited for 
implementing mobility if specific circumstances are met. The MAC layer is ideal for 
delivering fine grain mobility inside homogeneous networks, by virtue of the fast, 
cost-effective switching technologies and the address learning schemes available at 
this layer. Similarly, the networking layer is best suited for implementing coarse grain 
mobility in cases where mobiles cross subnets and hence require new IP addresses to 
remain reachable, or when movement happens across heterogeneous networks where 
MAC layer addresses are incompatible. DNS can further support coarse grain 
mobility by maintaining an up-to-date directory of users and their IP addresses which 
can be used to simplify the operation of the networking layer. A description of this 
architecture as shown in Fig. 4. In the following subsections, we provide a more 
detailed description of the architecture. Readers who are interested in a complete 
description of the architecture are referred to [11]. 



3.1 Extended LAN 

Over the past decade, we have witnessed tremendous developments in LAN 
technologies, such as increases in switch processing by a few orders of magnitude, 
and increases in link bandwidth and distances (owing to the fiber optics technology). 
These advances resulted in an increase in the size of LANs, and more recently, their 
deployment in metropolitan areas. We observed that such networks are well suited for 
providing mobility to portable IP devices. First, as mentioned earlier, mobile users 
can roam inside extended LANs without having to update their IP addresses. 
Secondly, the learning protocol implemented by LAN switches can be used to support 
diversity and adaptive routing. For these reasons, extended LANs are at the 
foundation of our network design for mobility. 

Given the importance of LANs for mobility, it becomes paramount to answer the 
following questions: 

1. How scalable are extended LANs in terms of number of users, user speed, 
application bandwidth and latency constraints? 

2. What is the appropriate LAN structure and how large an area can it serve? 
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3. What is the protocol for tracking users in the LAN? How reliable is the LAN and 

what is its reconfiguration algorithm? 

The answers to these questions need to take into account a variety of issues related 
to the wireless access networks, wired infrastructure, application traffic and 
requirements and user mobility. In order to minimize cost and maximize performance 
and reliability, the network design has to balance many parameters such as: 
processing power and storage capacity in the LAN switches, bandwidth across the 
wired links in the extended LAN, bandwidth and power consumption in the wireless 
cells. While a large extended LAN reduces the need for global mobility that can be 
inefficient, it also requires that the LAN support a larger number of users, and 
therefore increase the bandwidth requirements at the LAN switches and in the 
wireless cells in order to carry handoff control messages and user data. 




3.2 Dynamic DNS 

At the highest level of the protocol stack, dynamic DNS can be employed to track 
moving users as they change domains. The idea here is that DNS can behave like a 
directory that stores up-to-date, coarse-grain information on the location of mobile 
users. When the user enters a new domain and receives a new IP address, a DNS 
update is sent to ensure the most up-to-date mapping of the mobile's domain name to 
its IP address. Subsequently, communicating sessions that start following transition to 
a new domain can benefit from having the latest information and locate the user 
directly. However, sessions that are ongoing at the time of move cannot benefit from 
the directory update. This is because DNS lookups are not performed in the middle of 
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sessions for the purpose of renewing connectivity. Instead, a network layer solution 
needs to be devised for this purpose. By updating the DNS database, we minimize the 
need for network layer mobility support and allow for efficient routing for sessions 
which undergo domain name resolution prior to session establishment and which start 
following inter-subnet crossings. 

For dynamic DNS to work properly, caches that store the mapping of domain 
names to IP addresses at the communicating nodes need to be either: 

1 . Binding caches, which guarantee that the latest mapping is given 

2. Disabled 

3. Have a low TTL value (under a few seconds). 

The first choice may give better IP address lookup performance, particularly for slow 
mobility, however it can be expensive to maintain fast mobility and many users. The 
second option removes the cost of updating caches at the expense of lookup 
performance. The third option is a compromise between the first two options. 

It is interesting to note the parallels that exist between dynamic DNS solutions and 
directories techniques in cellular telephone systems such as the Groupe Speciale 
Mobile (GSM) or the Personal Communication System (PCS) [13,15]. In the GSM 
and PCS systems, users are identified by a unique phone number. One important 
feature of the GSM/PCS system is the automatic, worldwide localization of users. The 
system knows where a user currently is, and the same phone number of valid 
worldwide. A hierarchy of databases consisting of Home Location Registers (HLRs) 
and Visiting Location Registers (VLRs) is used to track users. The HLR contains 
information about the current VLR of a user, and the VLR knows the switching center 
via which the user can be reached. The HLR/VLR databases are similar to the DNS 
directories in our solution because they store up-to-date information on the location of 
every user. Notice that this feature of GSM/PCS renders the mobility management 
scheme a challenging problem. While this approach eliminates system-wide paging, 
which vastly reduces the radio link signaling, it introduces remote database lookups 
that may incur a large amount of wired network traffic and long call setup delay. 
Much effort has been focused on exploring efficient location management techniques. 
Extensions to standard HLR/VLR schemes, such as partial replication and caching 
have been developed to improve wireless call setup performance [15]. We believe that 
these techniques may apply to our solution in order to achieve an efficient and 
scalable dynamic DNS implementation. 



3.3 IP Mobility 

The network layer is important for providing mobility when users roam between 
different administrative domains, different subnets within the same domain, and 
possibly between heterogeneous networks. The network layer solution has two 
components that should be used in combination in order to deliver wide-area mobility. 
One component requires that a tunneling protocol be used, such as Mobile IP to 
redirect sessions that are ongoing at the time of a move between different domains. A 
second component of the solution is using host-specific entries at the routers of a 
domain to track groups of mobile users inside that domain, as is done in HAWAII. 
The first solution is important in order to eliminate the problem with DNS-based 




292 C. Hristea and F. Tobagi 



tracking that we outlined. The latter solution is important in order to improve the 
efficiency of roaming by giving users the ability to use a single IP address inside a 
large domain that extends beyond one subnet. To support a large number of users, 
routers that implement host- specific routing inside a domain should interconnect via a 
scalable fabric, and implement a scalable routing protocol. 



4 A Case Study 

A scalable LAN overlay is proposed to support mobile users in a metropolitan area 
network. A detailed description of this proposal can be found in [12]. As shown in 
Fig. 4, the extended LAN is implemented as a grid topology (e.g. the Manhattan 
Street Network). This is because the grid matches the topology of cities themselves - 
with the streets being rows and columns -, but also because the grid is scalable by 
virtue of its distributed nature. Wireless cells are connected to LAN switches in a 
hierarchical fashion. The hierarchy reduces the number of hops to be traversed when 
communicating between two access points in the grid, and therefore reduces latency. 
To connect to the Internet backbone, a scalable and distributed gateway router is 
necessary. The router needs to scale to support the aggregate traffic to and from all 
the cells in the LAN. For a large number of cells with many users, this bandwidth can 
become very large. For example, for a LAN supporting 2 million users, consuming 2 
Mb/s each, the routing bandwidth is 4 Tb/s. Furthermore, the router must be 
physically distributed across many smaller routers to allow for load balancing at the 
links connecting the LAN switches to the subnet router. 

A protocol is designed for the Manhattan Street Network that takes advantage of 
multiple links in the network and that balances the traffic load across all the links and 
switches in the LAN. The protocol works by partitioning switches into control and 
data partitions. Each control partition must have one or more switches in common 
with every data partition. Similarly, each data partition must have one or more 
switches in common with every control partition. For example, each row in the grid 
could be a control partition and each column, a data partition. A protocol similar to 
the Generic Attribute Registration Protocol (GARP) [16] is used to track users inside 
a given control partition according to the user location. Data packets for a given user 
are propagated along a given data partition (as given by the location where the data 
packet was first injected into the network) until the control partition for that user is 
reached and the packet delivered to the user. 

This LAN design has a number of advantages. The LAN does not rely on a single 
spanning tree or root switch. This is important for scalability as the LAN extends to 
large geographical scopes. By exploiting control and data partitions, it minimizes the 
latency of user location updates without affecting the latency of packet routing inside 
the LAN. Finally, its operation relies on existing LAN switching techniques and 
protocols, which makes the solution simple, inexpensive and easy to deploy. 
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5 Migration Path 

To transition to the mobile network of tomorrow, it may not be possible to design the 
supporting network infrastructure from scratch. Instead, support for mobility may 
need to be built on existing network structures, such as small subnets controlled by 
LAN switches and interconnected by IP routers with a small number of host-specific 
entries. Under such circumstances, one possibility is the use of Virtual Private 
Networks (VPN) to offer extended LAN connectivity across multiple small subnets. 
In order to support mobile users in the most effective fashion, the protocol to handle 
mobile users needs to be flexible enough to operate at different layers in the protocol 
stack, and versatile enough not to require changes in the implementation of the LAN 
switches and IP routers of that network. In particular, the protocols running on LAN 
switches should be based on existing 802 protocols, since they are implemented in 
hardware and therefore cannot be easily replaced or reprogrammed. The main 
challenge becomes how to use and optimize existing protocols for the purpose of 
efficient support for mobility. 



6 Conclusions 

This paper surveys the state-of-the-art in providing mobility support to mobile users 
in the Internet. In particular, emphasis is placed on micro-mobility techniques 
designed to accelerate Mobile IP. One observation is that all micro-mobility work in a 
similar way by requiring that network devices inside a given geographical area learn 
about the location of users and keep track of them as they move inside that area. The 
differences among these techniques are the type of device required to do the learning 
(it could be an IP router. Mobile IP agent or LAN switch) and the protocols for 
routing packets using the learning databases. This paper also presents an architecture 
for mobility, which exploits extended LANs, IP routing and dynamic DNS. One 
important feature of this architecture is its scalable and efficient LAN design, geared 
at optimizing IP mobility. By relying on existing technologies, and by virtue of 
working with Mobile IP, this architecture is also global, cost-effective, easily 
deployable and compatible with the Internet of today. 
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Abstract. IP multicast suffers from scalability problem with the number of con- 
currently active multicast groups, while scalability of QoS multicast is even fur- 
ther from being solved. In this paper, we propose an approach to reduce multicast 
forwarding state and provision multicast with QoS guarantees. In our approach, 
multiple groups are forced to share a single delivery tree. We discuss the advan- 
tages and some implementation issues of our approach, and conclude that it is 
feasible and promising. We then describe how to use our approach to provision 
scalable QoS multicast. Finally, we define metrics to quantify state reduction and 
use simulations to show how our scheme achieves state reduction. These initial 
simulation results suggest that our method can reduce multicast state significantly. 



1 Introduction 

Multicast state scalability is the problem we address in this work. Multicast is a mech- 
anism to efficiently support multi-point communications. IP multicast utilizes a tree 
delivery structure, on which data packets are duplicated only at fork nodes and are for- 
warded only once over each link. This approach makes IP multicast resource-efficient 
in delivering data to a group of members simultaneously and can scale well to support 
very large multicast groups. However, even after approximately 20 years of multicast 
research and engineering effort, IP multicast is still far from being as common-place as 
the Internet itself. 

Multicast state scalability is among the technical difficulties that delay its deploy- 
ment. A multicast distribution tree requires all tree nodes to maintain per-group(or even 
per- group/source) forwarding state, which grows at least linearly with the number of 
“passing-by” groups. As multicast gains widespread use and the number of concur- 
rently active groups grows, more and more forwarding state entries will be needed. 
More forwarding entries translates into more memory requirement, and may also lead to 
slower forwarding process since every packet forwarding involves an address look-up. 
In QoS multicast, the problem becomes even worse, because not only routes but also 
resources(eg, bandwidth) for individual multicast group are needed to maintain. This per- 
haps is the main scalability problem with IP multicast and QoS multicast provisioning 
when the number of simultaneous on-going multicast sessions is very large. 

Recently, much research effort has focused on the problem of multicast state scalabil- 
ity. Some schemes attempt to reduce forwarding state by tnnneling lfHl or by forwarding 
state aggregation llfini . Thaler and Handley analyze the aggregatability of forwarding 
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state inO using an input/output filter model of multicast forwarding. Radoslavov et 
al. propose algorithms to aggregate forwarding state and study the bandwidth-memory 
tradeoff with simulation in mm . Both these works attempt to aggregate routing state after 
this has been allocated to groups. Second, some other architectures aim to completely 
eliminate multicast state at routers using network-transparent multicast, which 

pushes the complexity to the end-points. 

Though most research papers on QoS multicast are focusing on solving a theoretical 
constrained multicast routing problem, there have been efforts to bring QoS into existing 
IP multicast architecture, such as RSVP lll.'il . QoSMIC Ol, QoS extension to CBT 
m, and PIM-SM QoS extension H- But all these schemes are using per-flow state, 
keeping track of routes and resources information for each individual group, which 
suffers scalability problem as mentioned above. 

In this paper, we propose a novel scheme to reduce multicast state and provision 
scalable QoS multicast, which we call aggregated multicast. Our difference with previous 
approaches is that we force multiple multicast groups to share one distribution tree, 
which we call an aggregated tree. This way the total number of trees in the network 
may be significantly reduced and thus forwarding state; core routers only need to keep 
state per aggregated tree instead of per group. In this paper we examine several design 
and implementation issues of our scheme and describe how to use aggregated multicast 
scheme to provision multicast with QoS guarantees. We will also present results from our 
initial simulation experiments in which our scheme achieves significant state reduction 
in the worst case scenario where group members have no spatial locality at all. 

The rest of this paper is organized as follows. Section |2| introduces the concept 
of aggregated multicast approach and discusses some implementation related issues. 
Section|3talks about QoS provisioning on the aggregated tree. Section@]proposes metrics 
to quantify multicast state reduction in aggregated multicast and presents simulation 
results. Section|3gives a short summary of our work. 

2 Aggregated Multicast 

Aggregated multicast is targeted as an intra-domain multicast provisioning mechanism 
in the transport network. For example, it can be used by an ISP (Internet Service Provider) 
to provide multi-point data delivery service for its customers and peering neighbors in its 
wide-area or regional backbone network (which can be just a single domain). The key idea 
of aggregated multicast is that, instead of constructing a tree for each individual multicast 
session in the core network (backbone), one can have multiple multicast sessions share a 
single aggregated tree to reduce multicast state and, correspondingly, tree maintenance 
overhead at network core. 

2.1 Concept 

Fig. ^illustrates a hierarchical inter-domain network peering. Domain A is a regional 
or national ISP’s backbone network, and domain D, X, and Y are customer networks of 
domain A at a certain location (say, Los Angeles). Domain B and C can be other customer 
networks (say, in New York) or some other ISP’s networks that peer with A. A multicast 
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session originates at domain D and has members in domain B and C. Routers Dl, Al, 
A2, A3, B1 and Cl form the multicast tree at the inter-domain level while Al, A2, A3, 
Aa and Ah form an intra-domain sub-tree within domain A (there may be other routers 
involved in domain B and C). The sub-tree can be a PIM-SM shared tree rooted at an RP 
(Rendezvous Point) router (say, Aa) or a bi-directional shared CBT (Center-Based Tree) 
tree centered at Aa or maybe an MOSPF tree. Here we will not go into intra-domain 
multicast routing protocol details, and just assume that the traffic injected into router A 1 
by router D 1 will be distributed over that intra-domain tree and reaches router A2 and 
A3. 




Fig. 1. Domain peering and a cross-domain multicast tree, tree nodes: Dl, Al, Aa, Ab, A2, Bl, 
A3, Cl, covering group Gq(D 1, Bl, Cl). 



Consider a second multicast session that originates at domain D and also has members 
in domain B and C. For this session, a sub-tree with exactly the same set of nodes will 
be established to carry its traffic within domain A. Now if there is a third multicast 
session that originates at domain X and it also has members in domain B and C, then 
router XI instead of Dl will be involved, but the sub-tree within domain A still involves 
the same set of nodes: Al, A2, A3, Aa, and Ab. To facilitate our discussions, we make 
some distinctions among these nodes. We call node Al a source node at which external 
traffic is injected, and node A2 and A3 exit nodes which distribute multicast traffic to 
other networks, and node Aa and Ab transit nodes which transport traffic in between. 
In a bi-directional inter-domain multicast tree, a node can be both a source node and an 
exit node. Source nodes and exit nodes together are called terminal nodes. Using the 
terminologies commonly used in DiffServ[0|, terminal nodes are often edge routers and 
transit nodes are often core routers in a network. 

In conventional IP multicast, all the nodes in the above example that are involved 
within domain A must maintain separate state for each of the three groups individually 
though their multicast trees are actually of the same “shape”. Alternatively, in an aggre- 
gated multicast approach, one can setup a pre-defined tree(or establish on demand) that 
covers nodes Al, A2 and A3 using a single multicast group address (within domain A). 
This tree is called an aggregated tree (AT) and it is shared by all multicast groups that are 
covered by it and are assigned to it. We say an aggregated tree T covers a group G if all 
terminal nodes for G are member nodes of T. Data from a specihc group is encapsulated 
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at the source node. It is then distributed over the aggregated tree and decapsulated at exist 
nodes to be further distributed to neighboring networks. This way, transit router Aa and 
Ab only need to maintain a single forwarding entry for the aggregated tree regardless 
how many groups are sharing it. 

2.2 Implementation Considerations 

It is not our goal to provide protocol details for aggregated multicast in this paper. 
However, a high-level overview of how it can be implemented in practice will provide 
a reality check which helps validate our work and provides some insights regarding its 
advantages and drawbacks. 

First of all, there are various options for distributing multicast traffic of different 
groups over a shared aggregated tree. Regardless of the implementation, there are two 
basic requirements: (l)the original group address of data packets must be stored some- 
where and can be recovered by exit nodes to determine how to forward these packets 
in the access network, and; (2)some kind of identification for the aggregated tree which 
the group is using must be carried in the packet header and is used by transit nodes 
to forward the packet. One possibility is to use IP encapsulation as said above, which, 
of course, adds complexity and processing overhead (at terminal nodes). A more effi- 
cient solution is MPLS (Multiprotocol Label Switching) (El in which labels can identify 
different aggregated trees. 

To handle aggregated tree management and matching between multicast groups and 
aggregated trees, a centralized management entity called tree manager is introduced. A 
tree manager has the knowledge of established aggregated trees in the network and is 
responsible for establishing new ones when necessary. It collects (inter-domain) group 
join messages received by border routers and assigns aggregated trees to groups. Once 
it determines which aggregated tree to use for a group, the tree manager can install 
corresponding state at the edge nodes involved, or distribute corresponding label bindings 
if MPLS is used. Aggregated tree construction within the domain can use an existing 
routing protocol such as PIM-SM, or use a centralized approach like what proposed in 
centralized multicast! Si|, or use MPLS signaling protocols extensions proposed in|)9( to 
support the establishment of pre-calculated trees. 

The set of aggregated trees to be established can be determined based on traffic 
pattern from long-term measurements. Let us say, for example, measurements in MCI- 
Worldcom’s national backbone show that there are always many concurrent multicast 
sessions that involve three routers in Los Angeles, San Francisco and New York. Based on 
that knowledge, a network operator can instruct the tree manager to setup an aggregated 
tree covering routers in these three locations. Aggregated trees can also be established, 
changed (to add/remove nodes) or removed dynamically based on dynamic traffic moni- 
toring. Knowing a set of existing aggregated trees, a tree manager can ’’match” a specific 
group, wifh given group membership (sef of ferminal nodes), fo an aggregated tree that 
covers the group (i.e., all terminal nodes are member nodes of the tree). 
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2.3 Discussions 

A number of benefits of aggregation are apparent. First of all, transit nodes don’t need to 
maintain state for individual groups; instead, they only maintain forwarding state for a 
potentially much smaller number of aggregated trees . On a backbone network, core nodes 
are the busiest and often they are transit nodes for many ”passing-by” multicast sessions. 
Relieving these core nodes from per-micro-flow multicast forwarding enables better 
scalability with the number of concurrent multicast sessions. In addition, an aggregated 
tree doesn’t go away or come up as individual groups that use it, thus tree maintenance 
can be a much less frequent process than in conventional multicast. The benefit of control 
overhead reduction is also very important in helping achieve better scalability. 

There are a number of concerns raised by this approach. A prime concern is mem- 
bership dynamics. The problem occurs when a new edge node is added but it is not 
covered by the current tree, or when an edge node leaves the group and yet it still re- 
ceives multicast traffic for this group (ie, bandwidth wastage). These problems can be 
alleviated by allowing a group to switch dynamically from one tree to another. To avoid 
the problems caused by membership dynamic changes, an ISP should require a cus- 
tomer to provide a list of group members (i.e., borders routers connecting to customer 
networks participating in the group) prior to the start of a multicast session and not to 
change group membership for the life of the multicast session - this is like providing 
a multi-point ”VPN” (virtual private network) service. On the other hand, one may ar- 
gue that, membership change on the backbone is very infrequent for many applications. 
For example, an Internet TV station may use an ISP’s national backbone to distribute 
its programming to local regional networks, then to subscribers. There can be frequent 
membership dynamics at access networks connected to subscribers, but membership of 
backbone nodes is likely to be fixed or change very slowly if there is a large population of 
TV viewers. Another example is video-conferencing in which participants are expected 
to be in the group throughout the session or over a long period of time. 

In group to aggregated tree matching, complication arises when there is no perfect 
match or no existing aggregated tree covers a group. A match is a perfect or non-leaky 
match for a group if all its leaf nodes are terminal nodes for the group thus traffic will 
not “leak” to any nodes that do not need to receive it. For example, the aggregated tree 
with nodes (Al, A2, A3, Aa, Ab) in Fig. [I] is a perfect match for our early multicast 
group Go which has members (Dl, Bl, Cl). A match may also be a leaky match. For 
example, if the above aggregated tree is also used for group Gi which only involves 
member nodes (Dl, Bl), then it is a leaky match since traffic for Gi will be delivered 
to node A3 (and will be discarded there since A3 does not have state for that group). 
A disadvantage of leaky match is that certain bandwidth is wasted to deliver data to 
nodes that are not involved for the group. Now let’s get back to the problem. When 
no perfect match is found, a leaky match may be used, if it satisfies certain constraint 
(e.g., bandwidth overhead is within a certain limit). This is often necessary since it is not 
possible to establish aggregated trees for all possible group combinations. The trade-off 
is bandwidth overhead vs. the benefit of aggregation. When no existing aggregated tree 
covers a group, either conventional multicast is used, or a new tree is established or an 
existing tree is extended (by adding new nodes) to cover that group. Of course, it is 
possible to enforce that aggregation is only applied to groups that are covered by a set 
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of aggregated trees established based on long-term traffic pattern and any other group 
will use conventional multicast. 



3 Provision Multicast with QoS Guarantees 

One motivation for aggregated multicast is to provision multicast services with QoS 
guarantees in future QoS-enabled networks. This problem has not attracted much atten- 
tion yet within IETF since the first priority so far has been to provide IP data (typically, 
best effort) multicast services. We note however that real time, interactive multicast ap- 
plications will be in the future at least as important as(if not more than) data, or more 
generally, non real time multicast applications. There is an interesting reason why the 
support of QoS oriented multicast for interactive applications will become very im- 
portant in the future Internet. Today, many non real time applications such as news, 
software distribution, etc, can be effectively supported by alternate techniques (to net- 
work level multicasting) such as web caching and application level multicast. In fact, 
these alternate techniques are often invoked to bypass the problems posed by IP level 
multicast. Real time (but, non interactive) applications such as video on demand can 
take advantage of the same alternate techniques (eg, web caching). In contrast, if we 
consider true interactive, real time applications such as video conferencing, distributed 
network games, distributed virtual collaborations (with real time visualization and re- 
mote experiment steering), distance lectures with student participation, we realize that 
alternate techniques such as web caching would severely affect time responsiveness. 
Moreover, interactive applications cannot be effectively supported by multiple unicast 
connections, since they typically requires many to many communications (it would be 
extremely costly to provision N x N connections, each with guaranteed bandwidth allo- 
cation!). As a result, it is important to address the scalability of QoS multicast, since it 
will be a prominent offering in the gamut of future Internet services. 

As we already did for data multicast, the main technique we will be proposing 
in order to reduce router processing 0/H and enhance scalability of QoS multicast is 
tree “aggregation”. To this effect, we wish to note that Internet community has already 
embraced “flow aggregation” as the philosophy for scalable QoS provisioning. In fact, 
today people are backing away from the micro-flow based QoS architecture, namely 
the Integrated Services architecture |5||, and are moving towards aggregated flow based 
architecture - the Differentiated Services architecture 0] - at least in the network core. The 
argument backing the aggregated approach is simple; the per-flow reservation and data 
packet handling required by Integrated Services simply do not scale to large networks. 

As of now, however, the success of the Diff Serv concept as observed in the QoS 
unicast applications has not materialized yet in the QoS multicast word. In fact, over the 
past few years, several meritorious QoS multicast schemes have been proposed, all still 
inspired by the Int Serv model, and all dealing with individual flows. 

The main criticism one can move to such schemes is poor scalability. In earlier 
sections we argued on the O/H required to set up and maintain routes for individual 
best effort multicast groups. The problem becomes much more complex if one must 
allocate and maintain not only routes but also resources (eg, bandwidth) for individual 
groups. The scalability problem, however, has not so far deterred the investigation of 
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Int Serv QoS multicast solutions. The main reason was the lack of incentive: given that 
conventional multicast routing requires per flow state, then, why should we seek non per 
flow state QoS multicast solutions. 

The emergence of aggregated, scalable techniques like the one presented in the 
previous sections of this paper will clearly change the situation. If multicast routing 
has become scalable, then QoS provisioning to multicast applications should also be 
scalable. In fact, the aggregated multicast solutions presented in the previous sections 
will be the starting point for scalable, real time QoS multicast services in the Internet. 



3.1 QoS Provisioning on the Aggregated Tree 

To understand how several multicast groups can be aggregated and managed with QoS 
support, one may go back to the MPLS (Multi Protocol Label Switching) concept men- 
tioned in the previous section. In essence, the aggregated multicast scheme can be viewed 
as the extension of MPLS from the path to the tree. MPLS plays a key role in unicast 
QoS support. It “pins down” the path, allowing the use of arbitrary alternate paths not 
available from the common routing tables. It tunnels several sessions on the same path, 
by encapsulating the IP packets in an MPLS envelope. It enables QoS provisioning and 
grooming on a per MPLS path basis (as opposed to a per flow basis). As a consequence, 
CAC on individual sessions is carried out at the edge node (or Border Gateway) only, 
with minimal latency and without engaging the intermediate nodes along the MPLS 
path. It enables “measurement based” resource tracking and Call Acceptance Control. 
Namely, the edge node need not keep track of the exact number of IP telephony calls (say) 
currently multiplexed on the MPLS path; it simply monitors MPLS utilization (using a 
proper window average) and determines available bandwidth and acceptance/rejection 
policy. Finally, the MPLS mechanism allows very flexible sharing of bandwidth across 
all flows multiplexed on the same path. The MPLS approach is consistent with the Diff 
Serv principles of flow aggregation. No per flow resource allocation or signaling is 
required at the intermediate core routers. 

The proposed aggregated multicast approach will extend all the above described 
MPLS features to the Aggregate Tree. In particular, QoS provisioning will be done in 
the background, and may be coordinated between the bandwidth broker of the domain 
in question and the shared tree managers (assuming one resource manager per shared 
tree). 



3.2 QoS Aggregate Tree Implementation and Operation 

In the following we outline a straw-man implementation of the QoS Aggregated Multi- 
cast Tree. This implementation relies on and expands upon the basic AT implementation 
described in the previous section. 

(a) When an AT (Aggregated Tree) is initialized, and is earmarked for the support of 
a particular QoS application (eg, video conference), it receives a bandwidth allocation 
commensurate to the traffic predictions for that application among that particular set of 
destinations. The AT is a permanent tree in that it is long-lived and expected to carry a 
large number of sessions simultaneously. 
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(b) A measurement based bandwidth management scheme assures that the ratio 
(average traffic load)/(allocated bandwidth) is adequate for the application (accounting 
for both traffic statistical characteristics and application QoS requirements) and for 
the current load. Note that different applications may require different safety margins 
depending on their statistical characteristics and their delay and packet loss constraints. 
If necessary, traffic statistics of an application can be “learned” by observing the flows 
at the edge nodes. The AT bandwidth manager can be centralized or distributed. In 
a distributed implementation, each edge node has a bandwidth agent (BA). The BA 
continuously monitors the above mentioned ratio and acquires/releases bandwidth as 
appropriate. For example, if more bandwidth is needed, the BA uses the existing intra 
domain tools (eg, Q-OSPF) to determine if more bandwidth is available on the path from 
an edge node to the Core or Rendezvous Point router (assuming a CBT approach). The 
peripheral BAs exchange information about bandwidth available and come up with a 
consistent bandwidth allocation decision (note that it would not help if one BA allocated 
10 Mbps and another allocated only 5Mbps!). The BA allocation decisions may be 
supervised by the domain bandwidth broker, to ensure fair resource allocation across 
ATs supporting different applications. 

(c) Call Acceptance Control is decided at the edge node, based on “measured” avail- 
able bandwidth on the AT. This eliminates the very high latency typically experienced 
in conventional “per flow” QoS multicast approaches. 

(d) Users dynamically join/leave an existing multicast AT in a totally transparent 
way and with zero latency - no bandwidth needs to be allocated or released. This is a 
dramatic improvement with respect to the node processing O/H and latency required by 
per flow QoS schemes. 

3.3 Statistical Allocation Advantage 

Typically, the use of the AT implies a “wastage” of resources since the streams are 
delivered to more destinations than strictly necessary. This wastage is traded off with 
the reduction in processing O/H and the ease of path and resource maintenance. There 
are situations, however, when the Aggregation approach can in fact lead to bandwidth 
allocation savings. We outline one such example below. 

Consider a video conference with a maximum of 100 simultaneous participants 
placed at different locations. The multicast tree is for simplicity a star with direct, point 
to point links from user to RP router (see Fig. n. Assume that 1 Mbps “equivalent” 
bandwidth is required by each session. Typically, at any given time only the video 
and audio of the person currently speaking is multicast to the group. In the traditional 
IntServ, “per flow” reservation approach, a full duplex 1 Mbps allocation is required (on 
the link from user to RV router) when a user joins the group. Thus, the “total” bandwidth 
allocation (counting the unidirectional bandwidth in each direction of the link) is 2S 
Mbps, where S is the number of current participants. If we use the AT scheme, the total 
allocated bandwidth is 101 Mbps, regardless of the number of simultaneous participants. 
This is because the AT bandwidth agent BA measures bandwidth usage, that is 1Mbps 
on each link in the direction RP to edge node, and 1 Mbps summed over all uplinks from 
edge node to RP (assuming that only one member is transmitting at a time). If we plot 
the allocated bandwidth as a function of S (see Fig. 0 , we note that the Int Serv scheme 



Aggregated Multicast for Scalable QoS Multicast Provisioning 



303 



O 




o 



RP Router 



O User 



Fig. 2. Vedioconference multicast group. 



requires “more” bandwidth that the AT scheme for S' > 50. This is an unexpected results, 
which tells us that because of the more flexible, statistical allocation in the Aggregated 
Tree, we stand to save instead of waste bandwidth! The careful reader will notice that 
the Aggregate Tree saving leads to an allocation that is “asymmetric”; this asymmetry 
can be compensated across different trees, which typically have different roots and 
different topology layouts. Moreover, even the IntServ scheme could take advantage of 
the statistical sharing of uplink bandwidth among various transmitters. But, this would 
require checking the bandwidth allocation and state of several different multicast groups 
at each intermediate node ! 



Bandwidth Allocation 




Fig. 3. Total bandwidth allocation as a function of active conference participants. 




304 



M. Gerla et al. 



3.4 Extensions of the Basic QoS AT Scheme 

Within a domain, the ISP will offer several “permanent” ATs to choose from. The 
multicast group manager will periodically query the AT layout database for information 
regarding all the installed ATs, and will dynamically select the tree that best matches 
its current members configuration. As the membership grows, the groups can simply 
switch from one tree to another. Note that by virtue of the “soft state” operation induced 
by the measurement scheme, this switch-over is totally transparent. It does not require 
any bandwidth reallocation. 

In some cases, the addition of a couple of users in locations not served by the current 
tree may not warrant the switch-over to a new tree. The new users can then be easily 
accommodated by connecting them to the nearest edge node/router. 

In some applications (eg, battlefield communications , distributed visualization and 
control, etc.) it is important to provide fault tolerant multicast. For example, consider 
the control of a space launch carried out from different ground stations interconnected 
by an Internet multicast tree. This control scenario may require the exchange of real 
time, interactive data and streams. One elegant way to provide fault tolerance is the 
use of separate, possibly node disjoint multicast trees. For added reliability and zero 
switch-over latency, the duplicate data could be sent on both trees simultaneously. 

Another scenario in which the Aggregate Tree concept is beneficial is mobile handoff. 
Consider for example a video conference participant driving between Los Angeles and 
San Diego. Assume that both Los Angeles and San Diego are leaves of the AT tree 
to which the multicast group of the mobile user has subscribed. With our proposed 
scheme, as the user is “handed off” from the Los Angeles to the San Diego edge router, 
he finds a path with resources already allocated to him. There is minimal disruption of 
communications. This “soft handoff” is achieved with no overhead in the core network. 

4 Simulation Studies for State Reduction 

In this section we attempt to quantify multicast state reduction that can be achieved using 
aggregated multicast. It is worth pointing out that our approach of multicast “aggrega- 
tion” is completely different from multicast “state aggregation” approaches in fIM3 . 
We aggregate multiple multicast groups into a single tree to reduce the number multicast 
forwarding entries, while their approach is to aggregate multiple multicast forwarding 
entries into a single entry to reduce the number of entries. It is possible to further reduce 
multicast state using their approaches in an aggregate multicast environment. Here we 
study state reduction achieved by “group aggregation” before any “state aggregation” is 
applied. 

First, we introduce two state reduction metrics. Without losing generality, we assume 
a router needs one state entry per multicast address in its forwarding table. Here we care 
about the total number of state entries that are installed at all routers involved to support 
a multicast group in a network. In conventional multicast, the total number of entries 
for a group equals the number of nodes |T| in its multicast tree T (or subtree within a 
domain, to be more specific) - i.e., each tree node needs one entry for this group. In 
aggregated multicast, there are two types of state entries: entries for the shared aggregated 
trees and group-specific entries at terminal nodes. The number of entries installed for 
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an aggregated tree T equals the number of tree nodes \T\ and these state entries are 
considered to be shared by all groups using T. The number of group-specific entries 
for a group equals the number of its terminal nodes because only these nodes need 
group-specific state. 

Thus, we come up with the concept of irreducible state and reducible state: group- 
specific state at terminal nodes is irreducible. All terminal nodes need such state infor- 
mation to determine how to forward multicast packets received, no matter in conventional 
multicast or in aggregated multicast. For example, in our early example illustrated by 
Fig.HJ node A 1 always needs to maintain state for group Gq so it knows it should forward 
packets for that group received from D 1 to the interface connecting to Aa and forward 
packets for that group received from Aa to the interface connecting to node D 1 (and not 
XI or Yl), assuming a bi-directional inter-domain tree. 

Let Na be the total number of state entries to carry n multicast groups using aggre- 
gated multicast, Nq be the total number of state entries to carry the same n multicast 
groups using conventional multicast. We introduce the term overall state reduction 
ratio - i.e., total state reduction achieved at all routers involved in multicast, intuitively 
defined as 



= 1 - 



Na 

Nn 



( 1 ) 



Let Ni be the total number of irreducible state entries all these group need (i.e., sum of 
the number of terminal nodes in all groups), reducible state reduction ratio is defined 
as 



= 1 - 



Na-N, 

No 



( 2 ) 



which reflects state reduction achieved at transit or core routers. 

Further more, we define another metric about aggregation overhead. Assume an 
aggregated tree T is used by groups Gi,l < i < n, each of which has a “native” tree 
To(Gi), the average aggregation overhead for T is defined as: 



n X G{T) - G{To{Gi)) 

ELiG(To(G,)) 

n X G(T) 

E”=iC(To(G,)) ■ ’ 



( 3 ) 



where G(T) is the cost of tree T (total cost of all T’s links). Intuitively, Sa{T) reflects the 
amount of extra bandwidth wasted to carry multicast traffic using the shared aggregated 
tree T, in percentage. Let Ng be the total number of multicast groups and Nt be the total 
number of aggregated trees used to support these groups, average aggregation degree 
- i.e., the average number of groups an aggregated tree “matches”, is defined as 



AD= 

Nt 



( 4 ) 



The larger this number, the larger the number of groups that are aggregated into an 
aggregated tree, and correspondingly the more the state reduction. This number also 
reflects control overhead reduction: more groups an aggregated tree supports, fewer 
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number of trees are needed and thus less control overhead to manage these trees (fewer 
refresh messages, etc.)- 

Next we will present simulation results from a dynamic matching experiment allow- 
ing leaky matches. In this experiment, we use the Abilene^ network core topology as 
our simulation network, which has eleven nodes located in eleven metropolitan areas. 
Distance between two locations is used as the routing metric (cost), which could result 
in different routes than the real ones; however, routes from UCLA to a number of uni- 
versities (known to be connected to Internet 2) discovered by traceroute are consistent 
with what we expect from the Abilene core topology using distance as routing metric. 

We randomly generate multicast groups and use the following strategy to match them 
with aggregated trees and establish more aggregated trees when necessary. In generating 
groups, every node can be a terminal node (i.e., we don’t single out any node to be core 
node that is not directly accessible to neighboring networks); in simulation results to be 
presented, group size is uniformly distributed from 2 to 10. When a group G is generated, 
first a source-based “native” multicast tree Tq (with a member randomly picked as the 
source) is computed. An aggregated tree T (from a set of existing ones, initially empty) 
is selected for G if the following two conditions are met: (1)T covers G; and (2)after 
adding G, 5a{T) < bth', where bth is a fixed threshold to control Sa{T). When multiple 
trees satisfy these conditions, a min-cost one is chosen. If no existing tree satisfies 
these conditions, either (l)an existing tree T is extended (by adding necessary nodes) 
to cover G if the extended tree T' can satisfy the following condition: after adding G, 
5a{T') < bth’, or (2)the native tree for G is added as a new aggregated tree. Constraints 
above guarantee that bandwidth overhead is under a certain threshold. 




# of Group 

Fig. 4. Average aggregation degree vs. number of groups. 



Fig.0plots the simulation result of average aggregation degree vs. number of groups 
added for different bandwidth overhead thresholds. As the result shows, as more groups 
are added (i.e., more concurrently active groups), the average aggregation degree in- 
creases: we can “squeeze” more groups into an aggregated tree, in average. Bandwidth 
overhead threshold affects aggregation degree in a “positive” way: as we lift the control 
threshold, more aggregation can be achieved - as we are willing to “sacrifice” more 
bandwidth for aggregation, we are getting more aggregation. Fig. Eland Fig. 0 plot the 
results for overall state reduction ratio and reducible state reduction ratio defined in Eq.0 
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and 121 and demonstrate the same trend regarding the number of groups and bandwidth 
overhead threshold as aggregation degree. The results show that, though overall state 
reduction has its limit, reducible state is significantly reduced (e.g., over 80% for a 20% 
bandwidth overhead threshold). This also confirms our early analysis. 

In interpreting the implications of the above simulation results, we should be aware 
of their limitations: the network topology is fairly small and it is adopted from a logic 
topology and not really a backbone network with all routers at presence. Nevertheless, 
it should give us some feelings about the “trend”. Another fine point is that, this simu- 
lation represents a worst-case scenario since all groups are randomly generated and has 
no correlation or pattern. In practice, certain multicast group membership pattern (local- 
ity, etc.) may be discovered from measurements and can help to realize more efficient 
aggregation. 




# of Group 

Fig. 5. Overall state reduction ratio vs. number of groups. 




# of Group 



Fig. 6. Reducible state reduction ratio vs. number of groups. 
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5 Conclusions 

In this paper, we proposed a novel approach, aggregated multicast, to provision QoS 
multicast within intra-domain. The key idea of aggregated multicast is to force groups 
into sharing a single delivery tree. This way per-flow state is is eliminated from network 
core and is only required at edge routers. 

Our work could be summarized in the following points: 

- Aggregated multicast is an unconventional yet feasible and promising approach. 

- We discussed how this approach can be used to provision multicast with QoS guar- 
antees. 

- We proposed metrics to quantify multicast state reduction in aggregated multicast 
and our initial simulation shows promising results. 
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Abstract. Recently, a TCP-friendly, single-rate multicast congestion 
control scheme called pgmcc was introduced by one of the authors. In 
this paper, we study the fairness of pgmcc in a variety of scenarios in 
which a multicast transfer session competes with long-lived TCP flows 
and web-like traffic. We evaluate fairness of the pgmcc scheme at differ- 
ent timescales and compare it with the fairness of the TCP congestion 
control algorithm. 

Our results show that pgmcc is capable of sharing fairly the available 
bandwidth with competing connections. In particular, the use of a closed 
control loop between the sender and a group’s representative - which 
closely mimics the TCP congestion control - guarantees that pgmcc is 
fair to TCP sessions and that it is capable of reacting quickly to changes 
of network conditions without compromising fairness. 



1 Introduction 

It is generally accepted that in order to be successfully deployed in today’s 
Internet, IP Multicast needs a set of multicast congestion control mechanisms 
that are easily deployable, accurately tested and can co-exist with TCP. The 
IETF 0 has defined a set of procedures and criteria for evaluating reliable 
multicast protocols: the success of a proposed multicast protocol relies on its 
ability to compete with TCP traffic without threatening network stability. 

In 0, a multicast congestion control scheme called pgmcc has been proposed 
as a viable congestion control scheme for single-rate multicast sessions. That 
work explains the basic pgmcc design choices and shows how the scheme achieves 
scalability, stability and fast response to changes in network conditions in a wide 
variety of experimental scenarios. 

In this paper, we complement that work by studying in depth the fairness 
of pgmcc. Our goal is to better comprehend to what extent pgmcc is capable of 
fairly sharing the available bandwidth with competing TCP traffic. 

* Work partly supported by Cisco Systems, Microsoft Research, lAT-CNR and Uni- 
versita di Pisa. 

S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 309-E23 2001. 
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Although multicast fairness has been studied for many years 
there is still no general consensus on what should be the relative fairness be- 
tween multicast and unicast traffic. Different bandwidth allocation policies are 
possible 13 • For example, a multicast session could deserve more bandwidth than 
a TCP connection because it is intended to serve more receivers. On the other 
hand, it is also reasonable that a multicast session should not be given more 
bandwidth than TCP connections, in order not to penalize TCP connections 
that share portion of the path with a multicast session with a large number of 
receivers. 

In this context, the pgmcc scheme has been designed to operate in a multicast 
environment where neither unicast connections nor multicast sessions are to be 
given any kind of preferential treatment. We consider that bandwidth is allocated 
following the widely popular max-min fairness model: a multicast session will be 
allocated bandwidth according to the most congested path in its tree. Thus, 
if the bottleneck link on the most constrained path of the tree has a capacity 
C, the pgmcc session will be allocated a share C jn where n is the number of 
competing sessions on that link. 

We will evaluate pgmcc fairness comparing pgmcc and TCP average sending 
rates. Thus, for the scope of this paper, pgmcc is considered to be jair if its 
sending rate is comparable to the sending rate of an “equivalent” TCP connection 
(i.e. a TCP connections experiencing the same network conditions). 

Note that we do not address the problem of inter-receiver fairness 0, that is 
fairness among receivers of the same multicast group. Indeed, pgmcc adapts its 
transmission rate to the bandwidth available to the “worst” receiver of the group. 
Thus, by definition, pgmcc does not assure any kind of inter-receiver fairness. 

The rest of the paper is organized as follows. Section |2| provides a brief 
overview of the proposed congestion control scheme, with a description of the 
mechanisms that have an active role in guaranteeing the fairness of the scheme, 
namely: i) the election of a group’s representative {acker), ii) the loss rate es- 
timation performed by receivers, and, iii) the window-based flow control run 
between the sender and the acker that mimics TCP congestion control. In Sec- 
tion 0 we describe the fairness metric we use, while Section 0] describes a basic 
set of network scenarios that need to be analyzed in detail. Then, in Section El 
we presents the results of our simulations and Section El concludes the paper. 



2 Overview of pgmcc 

The pgmcc scheme is based on two separate but complementary mechanism: i) a 
window-based control loop which closely emulates TCP congestion control, and 
ii) a procedure to select a group’s representative {acker). 

The window based control loop is simply an adaptation of the TCP conges- 
tion control scheme to a protocol where lost data packets are not necessarily 
retransmitted, and so the congestion control scheme cannot rely on cumulative 
acknowledgements. In pgmcc, the “window” is simulated using a token-based 
scheme which permits to decouple congestion control from retransmission state. 
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One of the receivers in the group is elected by the sender as the acker, i.e. the 
node in charge of sending positive acknowledgements back to the source and 
thus controlling the transfer. 

The procedure to elect the group’s representative makes sure that, in presence 
of multiple receivers, the acker is dynamically selected to be the receiver which 
would have the lowest throughput if a separate TCP session were run between 
the sender and each receiver. For the acker selection mechanism, pgmcc uses a 
throughput equation to determine the expected throughput for a given receiver 
as a function of the loss rate and round-trip time. Unlike other schemes [ 7 |, 
the TCP throughput equation is not used to determine the actual sending rate, 
which is completely controlled by the window-based control loop. 

In principle, pgmcc’s congestion control mechanism works as follows: 

1 . Receivers measure the loss rate and feed this information back to the sender, 
either in positive acknowledgements (ACK) or negative acknowledgements 
(NAK). 

2. The sender also uses these feedback messages to measure the round-trip time 
(RTT) to the source of each feedback message. 

3. The loss rate and RTT are then fed into pgmcc’s throughput equation, to 
determine the expected throughput from the sender to that receiver. 

4. The sender then selects as the acker the receiver with the lowest expected 
throughput, as computed by the equation. 

The dynamics of the acker selection mechanism are sensitive to how the measure- 
ments are performed and applied. In the rest of this section we describe specific 
mechanisms to perform and apply these measurements (other mechanisms are 
possible as described in jO]). 

pgmcc operates end to end, and requires small constant state and a minimal 
amount of computation at both sender and receivers. We want to emphasize 
that, while our scheme involves positive and negative acknowledgements, we do 
not make any assumption on the reliability of the data transfer. This makes our 
scheme applicable equally well to unreliable data transfers. 

2.1 Round-Trip Time Measurement 

The classical way to measure the RTT without synchronized clocks is to include 
a timestamp in each packet from the source, and let receivers echo back the most 
recently received timestamp, possibly corrected with the difference between the 
time of reception and the time the feedback is actually sent (such delays are part 
of the feedback suppression schemes). The resolution of this method (especially 
for the correction factor) depends on the resolution of the clock at each receiver. 
If the latter is too coarse, the correction factor might introduce a large variance 
on the RTT estimates, biasing the results of the measurements in favour or 
against some receivers. Because we expect to deal with a large population of 
heterogeneous receivers, we cannot depend on the availability of a high resolution 
clock at all receivers. 
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As a consequence, and without too much loss of precision, in pgmcc we chose 
to measure the RTT in terms of packets: the sender simply computes the dif- 
ference between the most recent sequence number sent and the most recently 
sequence number seen by the receiver (echoed in each NAK and ACK packet). 
This way we do not need to send timestamps or rely on the timer resolution 
at the receiver; on the other hand, for a path with a given RTT (measured in 
seconds), the value in packets computed by pgmcc will vary depending on the 
actual data rate. However this variation applies in the same way to all receivers, 
so it is not a source of discrimination among receivers. Furthermore, the RTT 
measurement in pgmcc is only used for comparing receivers, not for the actual 
selection of transmit rate, so any discrepancy between the real and the measured 
RTT cannot influence the inter-protocol fairness. 



2.2 Loss Rate Measurement 

The loss measurement in pgmcc is entirely performed by receivers. Again, the 
measurement results do not directly influence the transmit rate, but are only 
used for comparison purposes. As a consequence, pgmcc is reasonably robust to 
different measurement techniques, as long as they are not influenced too strongly 
by single loss events. 

The method used for loss measurement is the exponentially weighted moving 
average (EWMA), which is formally equivalent to a single-pole digital low pass 
Alter applied to a binary signal Sj, where = 1 if packet i is lost, Si = 0 if 
packet i is successfully received. The loss rate pi upon reception or detection of 
loss of packet i is computed as 

Pi = CpPi-i -k (1 - Cp)p, 

where the constant Cp between 0 and 1 is related to the bandpass of the Alter. 
Experiments have shown good performance with Cp = 500/65536, and compu- 
tations performed with fixed point arithmetic and 16 fractional digits. 



2.3 Acknowledgements 

For each data packet (but not for retransmissions), one of the receivers is in 
charge of sending positive acknowledgements (ACKs). The identity of the acker 
is carried in each data packet. 

ACKs contain loss reports (same as in NAKs), and a couple of additional 
fields, namely the sequence number of the data packet which elicited this ACK, 
and a bitmap indicating the receive status of the most recent 32 packets. This 
allows the sender to recover lost ACKs, and deal properly with out-of-order ACK 
delivery (which can occur when the acker switches between nodes on different 
paths) . 
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2.4 Window-Based Controller 

pgmcc uses a window-based congestion control scheme which is run between the 
sender and the acker, and mimics TCP congestion control. To implement this, 
the sender manages two state variables: a windov^ W , and a token count T, 
both initialized to 1 when the session starts or restarts after a stall (i.e. ACKs 
stop coming in and a timeout expires). The role of W is to determine how fast 
the window opens, same as in TCP. Tokens are instead used to regulate the 
generation of data packets: one token is necessary (and consumed) to transmit 
one packet, and tokens are regenerated by incoming ACKs. 

In detail, W and T are updated as follows: 

— on session restart, IT = 1,T = 1; 

— on transmit, T = T — 1 (consume one token); 

— on ACK, W = W + 1/W,T = T + 1 + l/W; 

— on loss detection, W = W/2, ignore next W/2 acks. 

The behaviour on normal ACKs mimics TCP’s linear increase - the window 
expands by one packet for each round trip time. Similar to TCP, we assume a 
packet loss when a given packet has not been ACKed in a number of subsequent 
ACKs (this dupack threshold is set to 3 in our tests), and reproduce TCP’s 
multiplicative decrease by halving the window. In order to match the number 
of outstanding packets to the window count, we need to avoid incrementing the 
token count for IT/2 acks. Also, we do not react to further congestion events for 
the next RTT (this is easily achieved by recording the sequence number of the 
most recently transmitted packet). 



2.5 Acker Election and Tracking 



The acker election process in pgmcc aims at locating the receiver which would 
have the lowest throughput if each receiver were using a separate TCP connection 
to transfer data. Because the steady-state throughput of a TCP connection can 
be characterised in a reasonably accurate way in terms of its loss rate and round 
trip time |H|, the throughput for each receiver can be estimated by using these 
two parameters. 

Whenever an ACK or NAK packet from any of the receivers reaches the 
sender, the latter is able to compute the expected throughput Ti for that receiver 
by using the well-known TCP throughput formula: 



T,, = 






( 1 ) 



where Ri and pi are the round trip time and the loss rate measured for receiver 
z, respectively. At any given time, the sender stores the expected throughput 

^ Note that the “window” used for congestion control purposes does not correspond 
to the “window” used for reliability or flow control. 
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for the current acker, Tacker- This value is updated every time an ACK or NAK 
from the current acker is received. 

The selection process does not require knowledge of the whole population of 
receivers, or the evaluation of Ti for all of them. When we receive a NAK from 
node j, we can decide whether to switch to a new acker from the current one 
(node i) by just comparing Ti and 7}. 

We should remark that the acker selection process is unavoidably approxi- 
mate. Often, we only have a few RTT and loss rate samples from each potential 
acker, and those samples might be affected by large uncertainties. Furthermore, 
the formula used to elect the acker is approximate and derived under assump- 
tions which might not be valid during the switch. 

As a consequence, it is essential that we apply some histeresys when deciding 
to switch to a new acker, and in all cases, we should not interpret a change of 
acker as to a congestion signal. Rather, we assimilate the selection of a new 
acker to a move of the node in charge of sending ACKs to a path with different 
features. This is possible because for each data packet there is only one acker, 
and we have procedures to deal with duplicate, out of order and missing ACKs. 
Should the new acker experience congestion, we will get a timely notification by 
making use of the new ACKs. 



3 Fairness Metric 



Throughout this paper, we will compare the average send rates of TCP and 
pgmcc flows experiencing similar network conditions or competing for bandwidth 
on the same bottleneck link. Therefore, we declare pgmcc to be “fair” if its send- 
ing rate is comparable to the sending rate of a TCP connection that experiences 
the same network conditions. 

The timescale at which the sending rates are measured naturally affects the 
values of these measurements. The use of a too coarse timescale will hide some 
behaviours (e.g. sending rate burstiness) while a too fine timescale will make 
measures very dependent on transient phenomena (e.g. retransmit timeout). For 
this reason we will study fairness of pgmcc on a wide range of timescale, from 
approximately twice the round trip time of the connections under analysis to 
the entire duration of the experiments. 

We define the send rate of session i using b bytes long packets at the 
timescale r as: 



b- < packets sent in the interval (t, t + t) > 

Ri (t) = 

Then, we compute the fairness index Ft at timescale r as follows |3|: 



( 2 ) 



nT:=oixiit)r 



( 3 ) 



where X( (t) = R1 (t) / Bi is the ratio between the sending rate of session i and its 
allocated bandwidth Bi, and n is the number of session competing for bandwidth. 
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The fairness index is bounded between 0 and 1, where a value of 1 indicates that 
network resources are fairly shared according to the targeted allocation. 

4 Fairness of the Scheme 

In it has been shown that under normal operating conditions pgmcc and 
TCP share bandwidth fairly. On the other hand, P also mentions that there 
are some scenarios which need more investigation as far as fairness is concerned. 
These scenarios are discussed below, and the behaviour of pgmcc in those of 
them requiring further investigation is studied by simulations in Section 0 

Receivers behind the same bottleneck. Consider the scenario illustrated 
in Figure ^ where two receivers behind the same bottleneck experience very 
different round trip times. In this case, given that all packet losses occur on the 
bottleneck link Li, both receivers will send reports with the same loss rate value. 
Thus, in presence of NAK suppression, there is no guarantee that the sender will 
elect the receiver with the largest RTT as the acker. In fact, when the difference 
between RTTs is very larg^ the sender will probably never receive a NAK from 
the receiver with the larger RTT. 



pgmcc sender 






1Mbps, 50ms 
Li 



T 10Mbps^200ms 

“ T 



pgmcc receiver 



pgmcc receiver 



PGM Network Element 



Fig. 1. In absence of other traffic, the two receivers wifi experience the same packet 
fosses but very different round trip times. 



This behaviour cannot be considered a source of unfairness. In fact, TCP itself 
favors flows with a smaller round trip time, and in presence of multiple receiver 
behind the same bottleneck there is no reason for the multicast session to adapt 
to the slowest rather than the faster of its members. 

Note that, if a TCP connection competes on link L 2 with the slowest receiver, 
that receiver will experience more losses than the rest of the group and thus will 
become the acker assuring a fair allocation of bandwidth along the multicast 
tree. Thus, this scenario does not lead to unfair behaviour of the scheme. 

Receivers with a very small round trip time. The acker election process 
is sensible to the timely distribution of loss reports from receivers. As soon 



^ at least comparable with the maximum NAK backoff time m, the random time 
PGM receivers wait before sending a retransmission request. 
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as a receiver identifies a hole in the packets’ sequence numbers, it schedules a 
delayed NAK transmission with an updated loss report. Reports from receivers 
that experience a smaller round trip time will likely reach the sender sooner than 
others (which will be likely suppressed, instead, along the shared path). 

It is possible then for a “fast” receiver experiencing a transient period of 
congestion to be elected as the new acker, and to make the sender increase the 
sending rate (the window is increased approximately by one per round trip time 
of the current acker). In this case, even if “slow” receivers experience higher loss 
rates, the delay involved in the acker election (about 2 RTTs) will make the 
pgmcc session to behave unfairly toward competing TCP flows along the slow 
path. We will discuss this scenario in greater detail in Section 1^1 

Ambiguity in the acker election process. One of the possible caveats of 
using any TCP formula JQ) for the acker election process, is the difficulty for the 
sender to discern between two receivers along different paths with similar esti- 
mated throughput. Variability in receivers’ reports could mislead the sender, and 
make it choose alternatively between two or more receivers, possibly resulting 
overall in an unfair sharing of available bandwidth with competing TCP flows. 

In fact, if acker switches are too frequent, receivers might not have a chance 
to send at least three acknowledgements in a row, which is the minimum amount 
of feedback required to signal congestion to the sender. To avoid this situation, 
0 introduces some histeresys in the election process to favour the current acker. 
This permits to reduce the probability of an acker switch when multiple receivers 
have similar throughput expectations (according to the formula used). In Sec- 
tion o we will show a set of simulation results that show how pgmcc is capable 
of correctly handling such particular scenarios. 

Receivers with a very high loss rate. It is known that the TCP simplified 
throughput formula ([Q is an overestimate of the throughput for high loss rates 
(roughly above 5%). In the acker election process, this error can make the sender 
elect as the acker the wrong receiver, thus resulting in a unfair behaviour with 
other flows competing on the path with the actual worst receiver. 

A simple fix to this problem would be the use of a more precise formula for 
estimating the throughput jH]. However, in case of very high loss rates, TCP con- 
gestion control and the pgmcc scheme are both dominated by timeouts, making 
very difficult to implement any reasonably smooth control on the sending rate. 

In Section o we run a network simulation where we vary the loss rate on 
one path of the multicast tree to evaluate its impact on the protocol fairness. 

Receivers with uncorrelated packet losses. The presence of multiple end- 
to-end paths in a multicast tree can make the sender assume an overall loss 
rate much higher than the one of each individual receiver. This could lead to an 
average session bandwidth far below the fair share allocated to the session PJ. 
To solve this problem, in pgmcc each receiver computes the loss rate and feeds 
it back to the sender in each NAK packet. This way, the sender can estimate 
the loss rate observed by each receiver. Moreover, the sender uses the loss rate 
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only to elect the current acker, while it regulates the actual sending rate based 
on acknowledgement from the acker. 

In Section El we present a case study where we verify the behaviour of 
pgmcc, computing the average bandwidth of the session in presence of uncorre- 
lated packet losses and a very large group of receivers. 

Denial of service. Another issue common to all single-rate multicast schemes 
is the authentication of receivers’ report. Indeed, it is possible for a malicious (or 
malfunctioning) receiver to send fake reports that can drive the session trans- 
mit rate down to zero. However, this issue is out of the scope of this paper, 
because we believe that a solution to this problem cannot be found in the design 
of a congestion control schemes, but it requires separate mechanisms for the 
authentication of receivers by the sender. 

5 Experimental Results 

In this Section we investigate the pgmcc ’s behaviour in some of the scenarios 
presented in the previous Section. To this purpose, we have used an implemen- 
tatioifl of pgmcc under the ns simulator 0. In the scenarios described in this 
paper, routers are PGM-compliant Network Elements HH and TCP sources im- 
plement the NewReno modification. Both PGM and TCP data packets are 1000 
bytes long. 



pgmcc sender 



5Mbps, 2% packet di'op prob. 
60 packets FIFO queue 




15 pgmcc 
receivers 



TCP receiver 



Fig. 2. A topology to exercise the acker election process in presence of different paths. 
Link L 2 has a variable RTT. On link Li , three TCP flows compete for bandwidth with 
a pgmcc session with receivers on nodes 2 and 3. 



5.1 Acker Election 

An interesting network scenario to test the acker election procedures is shown 
in Figure El Here, a set of receivers lays behind a congested link, Li, which has 
a high RTT - mainly due to queueing delay -, while a second set of receivers 

® the source code used for the simulations described in this paper can be found at the 
following URL: http://www.iet.unipi.it/~luigi/pgmcc/ 
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make use of a high bandwidth link (L 2 ) with a variable end-to-end delay and a 
non-negligible packet loss (it may model a link with a high degree of statistical 
multiplexing) . 

In this experiment, link L\ has a capacity of 1 Mbps, a propagation delay 
of 50 ms and a FIFO queue with 60 slots; link L 2 has a capacity of 5 Mbps, 
a FIFO queue with 60 slots and a fixed 2% packet loss probability. On link L\ 
the pgmcc session competes for bandwidth with 3 long-lived TCP sessions. The 
multicast group counts 15 receivers per node. Of course the number of receivers 
per node does not influence the fairness of the protocol, given that all receivers 
behind the same node experience the same round trip time and packet losses. 

To verify the behaviour of pgmcc in different network scenarios, we vary the 
propagation delay of link from 10 ms to 400 ms. This way we can measure 
the fairness of the proposed scheme in presence of receivers with very different 
round trip times. On link L\, the round trip time is dominated by the queueing 
delay and can be as large as 580 ms. On link L 2 > no queues build up, so the 
round trip time is approximately twice the propagation delay. 

In Figure 1^1 the curve labeled “link L2” shows the average throughput of the 
session for propagation delays on link L 2 varying from 10 ms to 400 ms, when 
no receivers on node 3 join the group. As expected, the throughput is inversely 
proportional to the round trip time. 

When receivers on node 3 join the pgmcc session, the latter will compete for 
bandwidth on link L\ with 3 TCP sessions (“tcpl”..“tcp3”). The curve “pgmcc” 
shows the througput for a session with receivers on nodes 2 and 3. 

For short propagation delay on link L 2 , the pgmcc sender elects an acker 
behind link Li. When the delay on L 2 increases (to more than 130 ms), the 
throughput on L 2 falls below the 250 Kbit/s corresponding to the fair rate on 
Li for the 4 sessions, so the acker moves to a node behind link L 2 ■ 

A few observations can be made on these results: 

— as long as the control equation gives a clear indication of the worst path, 
the pgmcc sender behaves fairly with TCP flows (first part of the graph) or 
achieves the expected average throughput (second half of the graph) . 

— in presence of a very fast receiver (e.g. when the delay on link L 2 is 10 ms, 
the leftmost point in the graph) pgmcc shows a slightly unfair behaviour. 
This is due to the fact that any acker switch toward a receiver behind link 
L 2 makes the sender open the window very quickly. 

In order to study in detail the fairness of pgmcc we compute the fairness index 
for different timescales from 1 second to 100 seconds. Moreover, to compare 
the fairness of pgmcc to TCP fairness, we have run a simulation where 4 TCP 
sessions are competing for bandwidth on link L\. 

In Figure 0 we plot the average value of the fairness index as a function of the 
timescale for three particular values of the propagation delay on link L 2 '. 10 ms. 

The graph is the result of averaging the throughput of the last 400 second of simula- 
tions over 10 runs (90% confidence intervals are shown). The simulation duration is 
500 seconds. TCP flows are started at random times, uniformly distributed between 
0 and 10 seconds. 
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Fig. 3. Curve “link L 2 ” shows the throughput of a pgmcc session running only on link 
1/2- Curves “tcpl”, “tcp2” and “tcp3” show the average throughput of the TCP flows 
competing on link Li with the pgmcc session with receivers on Li and L 2 . 




Fig. 4. Average fairness index. Curve “tcp” refers to a scenario where only 4 TCP 
flows compete for bandwidth, while curves “pgmcc” refers to a scenario with a pgmcc 
session and a propagation delay over link L 2 of 10, 100 and 130 ms. 
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100 ms and 130 ms. The curves are computed by averaging over 10 simulations 
runs (90% confidence intervals are shown). As we can see from the graphs, pgmcc 
behaves fairly at different timescales and in all the network conditions we have 
considered: the average fairness index is never less than 0.75 and there is no 
difference between the fairness of pgmcc and TCP (note that 90% confidence 
intervals overlap for all points in the three graphs). 

These results show that pgmcc is as fair as TCP in situations where some 
receivers experience a round trip time which is much smaller than others, and 
also when two receivers have a similar throughput characterization that could 
lead to frequent acker switches, a potential source of unfairness. 

5.2 Networks with High Packet Loss Rates 

In this section we study the impact of a high loss rate path on the fairness of 
pgmcc. Due to the use of the simplified TCP formula, the sender may overes- 
timate the expected throughput of a receiver behind a link with a very high 
loss rate, and elect as the acker a receiver along a different path resulting in an 
overall unfair behaviour. 

To simulate such network conditions we use the simple topology shown in 
Figure 0 Li is a high capacity link (15Mbps, 50 ms delay, FIFO queue with 
100 slots) that carries a high volume of bursty traffic (Web- like traffic). Link L 2 , 
instead, has a very limited capacity (400Kbps, 50 ms delay, FIFO queue with 20 
slots) but only few connections competing for bandwidth. 




Fig. 5. Link L\ is a high volume link with 15Mbps capacity that experience very high 
loss rates, while link L 2 has a relatively small capacity of 400Kbps. Both links have a 
propagation delay of 50 ms and a FIFO quene with a bnffer size of 100 packets 



On link L\, a long-lived TCP session and a pgmcc session compete for bandwidth 
with a large amount of web-like cross traffic. We simulated web-like traffic with 
several ON/OFF UDP sources, with ON and OFF times drawn from a Pareto 
distribution. The mean ON time is 1 s, the mean OFF time is 2 s, and during 
ON time, the UDP sending rate is 500Kbps. On link L 2 instead, only one TCP 
session competes with the multicast session. 

We vary the number of UDP sources to simulate different traffic loads on link 
Li, also resulting in a loss rate ranging from 0 to about 25%. Figure 0shows the 
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loss rate at the bottleneck router as a function of the number of UDP sources, 
when only the web-like traffic is injected onto the network. 




Fig. 6. Loss rate at the bottleneck in presence of ON/OFF background traffic. 



At first, we are interested in measuring the fairness of pgmcc in presence of 
high loss rates. Therefore we run 10 simulations with a set of 10 pgmcc receivers 
behind link Li competing with a TCP connection and the UDP sources. Each 
simulation lasts for 2000 s. 

Figure Qshows the average throughput achieved by the multicast session and 
the TCP connection (90% confidence intervals are shown). As we can see from 
the graph, pgmcc and TCP achieve the same average throughput in presence of 
highly variable background traffic. 

A second set of simulations has been used to verify the fairness of the protocol 
in presence of different paths with very different loss rates. In particular our goal 
is to verify whether the use of the simplified TCP formula may lead to unfairness 
due to errors in the estimate of the expected throughput. 

In this set of simulations, 10 receivers behind link L 2 join the pgmcc session. 
On link L 2 , pgmcc competes with a TCP session. In Figure |S| we compare the 
average throughput of the pgmcc session and of the two TCP sessions as a 
function of the number of UDP sources (90% confidence intervals are also shown) . 

The graph in Figure 0 shows that the presence of a path with a very high 
loss rate has a not negligible impact on pgmcc fairness. 

Even when the average loss rate on link Li is around 5% (for about 90 
UDP sources), the pgmcc ’s throughput is below its fair share of bandwidth. This 
is due to the bursty nature of the background web traffic, that causes transient 
but heavy congestion events on link Li . As a direct consequence, the sender may 
elect as the acker one of the receivers behind Li, and then reduce its sending 
rate for the entire duration of the congestion event. This explains why, when 
the average loss rate on link L\ is around 5% (for about 90 UDP sources), the 
pgmcc ’s throughput is below its fair share of bandwidth. 

On the other hand, when the loss rate increases (above 15%), pgmcc is slightly 
more aggressive than TCP. This behavior is mainly due to two reasons: 
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— pgmcc and TCP congestion control algorithms substantially differ on the 
slow-start mechanism and the retrasmit timeout. While TCP adapts the slow 
start threshold and uses an exponential backoff for the retransmit timeout, 
pgmcc makes use of a fixed value both for the slow start threshold and for 
the retransmit timeout. 

— in case of two successive retransmit timeouts, the pgmcc sender may elect 
as acker a different receiveilH In this scenario, a long congestion event (that 
last more than twice the fixed retransmit timeout) on link Li may force the 
sender to elect as acker a receiver on link L 2 and to increase its sending rate, 
resulting in an unfair allocation of bandwidth on link L\. 

We believe that further investigation is needed to better understand the dy- 
namics of the pgmcc scheme in high loss scenarios. However one should keep in 
mind that such experiments are highly dependent on a number of details such 
as network parameters (delays, buffer sizes), characteristics of competing traffic 
(burstiness, especially), and on the choice of protocol’s parameters (such as time- 
outs and slow start thresholds) which also make a significant difference among 
different TCP flavours. As a consequence, the focus of these experiments should 
be on verifying the safe behaviour of the protocol (i.e. the fact that throughput 
decreases with increasing congestion, and that instances of different protocols do 
not starve each other), and not on moderate differences in the absolute through- 
put. 



5.3 Uncorrelated Losses 

An important aspect of the design of single-rate multicast congestion control 
schemes is the behaviour in presence of uncorrelated losses. Depending on how 
loss reports are handled, the source might assume an overall loss rate for the 
session much higher than the loss rate of each individual receiver. 

To get some indications on how pgmcc works in presence of independent 
losses with up to 100 receivers, we ran a simulation on the topology of Figure El 
where a pgmcc source initiate a multicast session with 100 receivers behind inde- 
pendent lossy links with 1% packet loss rate. An additional link with the same 
characteristics is used for a TCP flow, in order to compare performance of the 
congestion control scheme. 

At time 0, the TCP session and 10 pgmcc receivers are started. At time 300, 
90 more pgmcc receivers join the session. Figure shows the throughput of 
the TCP connection and of the multicast session over time. As shown from the 
graph, the presence of the 90 additional receivers at time 300 does not influence 
appreciably the throughput of the multicast session. 

Much larger scale tests are certainly useful to investigate this behaviour in 
more detail. However, such tests cannot be run with simple retransmission-based 

® In order to avoid the case where the current acker leaves the multicast session and 
the sender is not capable to elect a new acker and to transmit any new packet, due 
to the absence of acknowledgments 0- 
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1 100 pgmcc 
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Fig. 9. A topology with 100 independent links with 1% probability of packet drop. A 
TCP connection is run over another link with the same characteristics. 




Fig. 10. Impact on throughput of multiple receivers with uncorrelated losses. Initially 
10 pgmcc receivers and one TCP, then, after 300 seconds, 90 more receivers join the 
multicast session. 



repairs, or the repair traffic would quickly dominate the actual data traffic on 
the link from the source. 



6 Conclusion 

In this paper we have discussed the performance of a recently proposed single- 
rate multicast congestion control scheme called pgmcc. We mainly addressed the 
problem of fairness and TCP friendliness of the new scheme. 

We have identified a set of basic scenarios where fairness is concerned, and by 
simulations we have shown that pgmcc is capable of sharing available bandwidth 
fairly with competing TCP traffic. In particular, pgmcc behaves in a way very 
similar to TCP in a wide variety of network conditions, with very high loss rates 
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or a high degree of heterogeneity among multicast receivers in terms of round 
trip times and packet loss rates. 

We believe that more complex scenario can be built starting from the basic 
configurations we have analyzed. We are very confident that the performance of 
pgmcc will not significantly differ in such scenarios. 

The strength of pgmcc relies on its simple design. Two main factors make this 
scheme scalable and stable: i) the fast election process that permits the sender to 
quickly identify the group’s worst receiver, and ii) the use of a closed control loop 
between the sender and the group’s representative that mimics TCP congestion 
control. 

More extensive experiments certainly need to be made, but we believe that 
pgmcc is mature enough to be deployed on a real operational network. Therefore, 
future work will be devoted to specifying pgmcc features and requirements in the 
context of the IETF Reliable Multicast Transport Group and to implement this 
scheme on the most common operating systems. 
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Abstract. We consider the problem of designing decentralized algo- 
rithms to achieve max-min fairness in multicast networks. Starting with 
a convex program formulation, we show that there exists an algorithm 
to compute the fairshare at each link using only the total arrival rate at 
each link. Further this information can be conveyed to the multicast re- 
ceivers using a simple marking mechanism. The mechanisms required to 
implement the algorithm at the routers have low complexity since they 
do not require any per-fiow information. 



1 Introduction 

Multicast shows great promise as a network service for providing efficient content 
delivery from one sender to many receivers. However, its widespread deployment 
depends critically on the development of practical congestion control algorithms. 
One of the key challenges in developing such algorithms is to handle the hetero- 
geneity that often characterizes multicast sessions. A single session can include 
many receivers with widely varying bandwidth connectivities to the sender. In 
cases where receivers are required to all receive data at the same rate, the sender 
may choose a very low data rate in order to support the smaller reception rate. 
On the other hand, where fully reliable reception is not required and receivers 
have the flexibility to receive a subset of the data it is often possible for each 
receiver to select a rate consistent with its network connectivity. We shall refer 
to the former as single-rate sessions and the latter as multirate sessions. 

It was established in m in the context of max-min fairness that multirate 
reception enhances network fairness properties. Hence we focus on the problem 
of congestion control for multirate sessions. Moreover we focus on algorithms 
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that produce max-min rate allocations. In particular, using ideas from mm 
we formalize the problem of max-min multirate congestion control as a convex 
optimization problem and develop simple algorithms for its solution. 

There is a tremendous amount of literature on optimization-based congestion 



convex programs and derive congestion controllers that converge to the optimal 
solution of the convex programs. However, as we will show later, the multirate, 
multicast congestion control problem introduces certain constraints that are not 
easily incorporated in the framework of these earlier papers. While, in general, 
the convex program formulation, does not lead to implementable solutions for 
the multirate multicast problem, we show that there exists a particular choice of 
utility functions that approximate max-min fair allocation arbitrarily close and 
also provide implementable solutions. These algorithms are easily implemented 
at the receivers and require only that the network elements provide a packet 
marking capability. Such a capability is currently under investigation for the 
future Internet H2|. 

A number of multirate reception protocols have been proposed for the re- 
ception of streamed layered video, e.g. RLM (receiver-driven layered multicast) 
TCP-layered multicast HH, and LVMR (layered video multicast with re- 
transmissions) 0. However, none of these deal with the problem of providing 
inter-session fairness. In addition, there have been several studies of max-min 
fairness in the context of multicast. Tzeng and Siu uni first proposed its use in 
the context of single rate multicast sessions and presented an algorithm in the 
context of ATM for obtaining a max-min rate allocation in a static environment. 
More recently, Sarkar and Tassiulas PI presented and analyzed an algorithm 
for obtaining the rate allocation in a multirate multicast network. Unlike our 
proposed algorithm which only requires a packet marking capability (currently 
under investigation for the Internet), their algorithm requires storage of per- 
session state information for each link in the network. Hence it appears better 
suited for an ATM network architecture rather than the current Internet. 

The rest of the paper is organized as follows. In Section II, we first formulate 
the max-min fair rate allocation problem for networks with multirate, multicast 
flows as a convex program. We then show that there exists a simple, distributed 
algorithm that converges to the max-min fair rate allocation. This algorithm 
requires the network routers to use only the total arrival rate at each link. 
In Section HI, we present our algorithm in detail and study its convergence 
properties in Section IV. Simulation results are presented in Section V, and 
concluding remarks are provided in Section VI. 

2 Motivation for Using a Convex Program Formnlation 

Consider the following convex program: 



control, e.g.. 



, that formulate unicast congestion control problems as 




( 1 ) 
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subject to 




( 2 ) 



( 3 ) 



where L is the set of links, Ci is the capacity of link I, S is the set of sessions. 
Si is the set of sessions on link I, t^(s) is the set of virtual sessions in Session s, 
Lsr is the route (a contiguous collection of links) of Virtual Session r of Session 
s and Xsr is the transmission rate of Virtual Session r of Session s. 

The above convex program corresponds to a resource allocation problem 
where the objective is to maximize the sum of the utilities of all the users in 
the network, where each user (or virtual session) has a utility function given by 
— ^ „_i . The max that appears in constraint 0 reflects the property that the 
bandwith used by a multicast session on a link equals the maximum bandwidth 
required by any virtual session associated with the session that utilizes that link. 

For ease of presentation, we assume that the weights Ws are the same for all 
virtual sessions in a multicast session although one can allow this to be more 
general. Note that the above utility function is a special case of the functions 
considered in \J\. As n — >■ oo, this leads to a max-min fair allocation. 

To simplify the notation, we consider an example network shown in Figured 
There are three sessions, 0, 1 and 2. Session 0 has two receivers corresponding 
to two virtual sessions whereas Sessions 1 and 2 are unicast sessions. 



Fig. 1. A Y-network with three sessions: a multicast session with two receivers and 
two unicast sessions 

The capacity constraints become 



It is more convenient to write the constraint on Link A as two linear constraints 
as follows: 



X, 



Link A 




max{a;oi, a::o2} + x\ + X2 < C a, 



xoi + Xi < Cb 
X02 + X2 < Cc- 



( 4 ) 

( 5 ) 



Xoi +Xi+X 2 < Ca, 



( 6 ) 
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and 



Xq2 X\ -\- X 2 ^ Ca- 



( 7 ) 



Now this problem is in the form of a standard convex program subject to 
linear constraints. As in m, this can be solved using duality theory as follows: 

PAi{k + 1) = {PAi{k) + e{xoi{k) + Xi{k) + X2{k) - Ca))^, i = 1, 2 

ps(fc+ 1) = {PB{k) + e{xoi{k) + xi{k) - Cb))^ 

and 

pc{k + 1) = {pc{k) + e{xo 2 {k) + X 2 {k) - Cc) + , 
where Xi{k) are calculated as 



'^0 

xm{k) 



PAi{k) + PB{k), 



in 

^)=PA2ik)+Pc{k), 

= PAl{k) + PA2{k) + PB{k), 

= PAl{k) + PA2{k) +pc{k). 

X2 \K) 

In the above equations, PAi{k), PA 2 {k), PB{k) and pc{k) are the estimates (at 
time k) of the shadow prices (or Lagrange multipliers) corresponding to (^, ( 0 , 
and o, respectively. 

We introduce the notation x* and p* to denote the optimal values of the 
virtual session rates and shadow prices, respectively. For the purpose of this ex- 
ample, we will assume that, the optimal solution to the convex program satisfies 
Xq2 < a^oi- This implies that Q is an inactive constraint, i.e., one that would be 
satisfied with strict inequality at the optimal solution. 

While the above dual algorithm will converge, it is impossible to implement. 
Specifically, there are two issues: 



— Link A needs the values of Xqi and 0:02 to compute its shadow prices pAi 
and PA2] however, only max{a:oi,a::o2} is available. Since we have assumed 
that 0 is inactive, we know from standard convex optimization theory that 
the optimal value of p*A2 is equal to zero. Therefore, it may be sufficient to 
only calculate a single shadow price pA for link A using the iteration 



PA{k + 1) = {pA{k) + e(max{a;oi(fc)} -k Xi{k) + X 2 {k) - Ca))^- 

i 



At least locally (i.e., near the optimal solution), the iteration for pA will be 
the same as that for pAi and thus, PA{k) will converge to p*Ai if the iteration 
is started in a local neighborhood of p\i . 



330 



E.E. Graves, R. Srikant, and D. Towsley 



— Each virtual session has to know the sum of the shadow prices along its 
path. Suppose that pa2 is equal to zero in steady state; then Virtual Session 
2 of Multicast Session 0 needs to know only pc- On the other hand, Virtual 
Session 1 needs to know pA +Pb- However, when per-flow state is not main- 
tained, it is easiest to convey a single quantity for each link in the network. 
Thus, if a single pA is computed for link A and as in dD, the sum of the 
shadow prices on each path is conveyed to the edges of the network, then 
Virtual session 1 would receive pa +Pb, which is the correct information. 
But the sum of pA and pc would be conveyed to Virtual Session 2, which is 
incorrect. We address this problem below. 

Defining Xi = p^^^ , we obtain 

7/)" 

^ = Ali(fc) + A^(fc), (8) 

^01 

^ = Al2(fc) + A£(fc), (9) 

^02 
11 )'^ 

^ = Ali(fc) + Al2 + A^(fc), (10) 

Xi 

^ = Ali(fc) + Al,(fc) + A£(fc). (11) 

X 2 

We note that, as n — >■ 00 , approximates maxiPi (provided that the 

max is achieved by only one pi). Thus, in the limit, each source simply needs to 
know the maximum of the scaled shadow prices (A’s). We will now consider the 
implication of computing only A^ and using it in lieu of A^i and Xa 2 as n — >■ 00 : 

— Equation (0 becomes 

^ = max{A):i,A;)j}. 

^01 

Since we already argued that A^ = A^^ in steady-state, this is consistent 
with the solution obtained from the dual of the convex program. 

— Equation 0 yields 

^02 r \ * \ * I 

— =max{A^,Ac|. 

^02 

This equals max{AJj 2 ,Ap} only if A)^ < X*q since X\^ = 0- This is indeed 
true since by our assumption Xq 2 < Xqi- 
~ Since A ^2 = 0, from (TTnil and (CU, it is easy to see that the optimal solutions 
of xi and X 2 are unaffected by using just A^. For example, at the optimal 
solution, 

^ = max{A^i, A^ 2 : A^} = max{A^, A|j}. 

Thus, the above example suggests that in the case of this particular utility 
function, it may be sufficient to only use the aggregate flow into a link to compute 
the max-min fair solution. In the next section, we present our algorithm as 
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motivated by the above discussion. We show that the equilibrium point of the 
algorithm is indeed the max-min fair solution and provide conditions for the 
local convergence of the synchronous version of this algorithm. We also present 
a simple marking mechanism to implement the algorithm and finally present 
simulation results illustrating the global convergence of the algorithm even with 
asynchronous updates. 



Example: We consider the network in Figure illustrate the scaling for the 
shadow price and show that the scaled shadow price converges to the fairshare 
(called the link control parameter in M) as n — >■ oo. We let wq = w\ = W 2 = 1, 
Ca = 10, Cb = 15 and Cc = 5. For this example, it is easy to calculate the 
max-min fair rates as 

a;oi = 3.75, = 2.5, = 3.75, = 2.5. 

Alternately, we calculate the solution to the nonlinear program for integer 
values of n from 1 to 10, and plot the resulting values of the scaled shadow prices 
in Figures ?? and0 Since xq 2 -I- -I- ^2 < Ca and xqi + x\ < Cb, for all values 

of n, the scaled shadow prices A ^2 ^*b always equal to zero and therefore 

are not plotted. 



larrtbd.a.Ai 




Fig. 2. Scaled shadow price A^i 
for the Y-network as a function of 
the utility function parameter n 



lamt>d.ac 




Fig. 3. Scaled shadow price Xc for 
the Y-network as a function of the 
utility function parameter n. 



From the figures, we see that A^^ converges to 0.2666 and A^ converges to 
0.4. Using the algorithm in the fairshare on Link A is 3.75 which is ^/\*ai 
and the fairshare on Link C is 2.5 which is equal to 1/A^. Since Link B is 
underutilized, the fairshare on that link can be taken to be any value greater 
than or equal to Cs and the optimization algorithm yields a value of oo. o 

Remark 1. We note that the actual shadow prices themselves (at links A and 
C) go to oo as n — > 00 and thus, do not provide insight into the max-min fair 

1 fn 

rate allocation. It is therefore important to scale the shadow prices as Ai = p/ 
to obtain meaningful results. o 
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3 Weighted Max-min Fairness Algorithm 

We present an algorithm motivated by the discussions of the convex program 
formulation of the problem discussed in the previous section. In the algorithm, 
we compute the scaled shadow price which, as already discussed, corresponds to 
the inverse of the fairshare. 

The basic steps of the algorithm are summarized as follows: 

1. Compute the scaled shadow price: As dictated by the gradient-descent itera- 
tion of the dual problem, the shadow price (A; for each link /) is increased or 
decreased depending upon whether the link is overutilized or underutilized, 
respectively. The fairshare for a link is the inverse of the shadow price. 

2. Compute the allowable rate for each virtual session: Each virtual session is 
allowed to transmit at the minimum fairshare for the set of links that it 
traverses. This is the maximum allowable fairshare for that virtual session. 

3. Compute the load at each link: The load on each link is computed as the 
total arrival rate at the link. When multiple virtual sessions from the same 
session pass through a link, the maximum of their rates is used as the arrival 
rate for the multicast session. 

The above steps are repeated resulting in all virtual sessions being bottle- 
necked at max-min fairness. 

Before we present the algorithm, we introduce the following terminology: 

Cl : capacity of link I 

Xi{k): scaled shadow price of link I at time k 

fi{k): fairshare at link I at time k 

Xs{k): maximum allowable rate for session s at time k 

/;(fc):load on link I at time k 

e: scaling factor 

Ly : set of links traversed by virtual session v 
Sp. set of sessions traversing link I 

Vsi'. set of virtual sessions corresponding to session s on link I 
The algorithm can now be described as follows: 

— A;(0) is chosen at random for each I in the network such that 1/c/ < A;(0). 

— At each step 

1. Compute the shadow price for each link I in the network 

Xi{k -I- 1) = [ A/(fc) -I- eifiik) — Cl) , 

where \y]^ denotes min{max{?/, m}, M}. Depending upon the choice of 
the stepsize, the convergence properties of the algorithm may vary. Here 
we address only the situation where e is a constant. 

2. Compute the fairshare of link I at time instant k, fi{k) = 1/Xi{k). 

3. Compute the allowable rate for each virtual session v in the network 

Xy{k + 1) = Wymmfi{k+1) = 



maxigi„(A/(fc -I- 1)) 
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By approximating the maximum by a sum, we can implement the com- 
putation of the maximum by a simple marking algorithm using the ideas 
in mu. We will elaborate upon this later. 

4. Compute the load on each link as the sum of the rates of each session 
across that link. 



Note that the rate for Session s traversing link I is the maximum of the 
rates of the virtual sessions of Session s traversing the link. 

3.1 Max-min Fairness of the Equilibrium Point 

We now show that, if the algorithm converges to an equilibrium solution, the 
resulting rate allocation is max-min fair. Max-min fairness as defined in m is 
obtained iff every virtual session has a bottleneck link. By definition, for a link I 
to be bottlenecked with respect to a virtual session v it must satisfy the following 
conditions. 

1. li = Cl , i.e., the bandwidth allocated to the sessions traversing link I must 
equal the link’s capacity 

2. X* > X* for all r € Si. 

Lemma 1. The rate alloeation at the equilibrium point of the algorithm is max- 
min fair. 

Proof: We claim that the route of each virtual session contains a bottleneck link. 
To show this, assume that a Virtual Session v does not have a bottleneck link. 
Suppose no link in satisfies condition (1). Then, from the equation 



it is clear that the link will be underutilized, i.e., k < c/. Further, the equilibrium 
value of the scaled shadow price A; will be equal to l/cj. Since Xy = min/gi„ 1/Ap 
the equilibrium value of Xy would be equal to c; for some I G Ly which contradicts 
the fact that all links in the path Ly are underutilized. Thus, there exists a 
nonempty subset LJ, C Ly such that all links in L'^ satisfy condition (1). 

Next suppose that all links in violate condition (2). Now, consider a 
link Z G Given Xy < Xy ior all r G Si, r ^ v, and by definition of Xy = 
min^eLu 1/Am, it follows that Xy < 1/Ap Again, Xy = minmGL„ 1/Am = 1/Aj < 
1/A; for some link j G Ly. Suppose that j ^ L/, then it would imply that 
Xj = 1/cj which implies that Xy = Cj and thus, Ij = Cj, which is a contradiction. 
Therefore, j G L'y. This must be true for all links I G L'.„. However for the case 
where j = I, then 1/A; < 1/A; which is contradictory, so the link cannot violate 
condition (2). o 




A;(Zc -I- 1) — [A;(fc) e{k — 
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3.2 Implementation Using One-Bit Packet Marking 

Our algorithm requires the network to convey minjgi^ /;(/c) to each Virtual 
session v. This can be accomplished using a simple marking scheme. First, moti- 
vated by the solution to the convex program in the previous section, we use the 
following approximation: 






1 



where n is some large number. Thus, we can equivalently think of the information 
to be conveyed to each virtual session to be l/(//(fc))"'. Now, combining this 

with the marking scheme presented in we obtain the following algorithm to 
convey the fairshare to the virtual sessions: 

— Each link I marks every packet that traverses it with probability 
(1 _ e-i/(/iW)"). 

— Each virtual session v keeps track of the rate at which it receives unmarked 
packets. From the marking scheme described above, the rate of unmarked 
packets for a virtual session v is 

exp I - E l/(//(fc)) 




Thus, the virtual session can obtain an estimate of the minimum fairshare of 
the links in its path as l/(— lnM„(A:))”, where is the current estimate 

of the rate at which it receives unmarked packets. 

While the above algorithm would, in principle, provide the required infor- 
mation to the sources, there is a serious implementation difficulty with this 
approach. For large values of n, depending upon the value of 1/ fi{k), l/(/;(fc))"' 

could either be very large or very close to 0. As a result , e 
could be very close to 1 or 0. This leads to the numerically unstable condition 
where all packets are always marked or none of the packets are marked over long 
intervals of time. 

To overcome this problem, we consider the following transformation: 

4(t) = log,(^) 

The idea behind the above transformation is that if K and b are chosen so that 
Si{k) remains close to one, then one can convey max;gi^ Si{k) to the sources 
using the marking probability 



thus avoiding numerical instability. 
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To find appropriate values for b and K, we assume that the fairshares at each 
link are constrained to lie in an interval [/mm, fmax] ■ The upperbound fmax could 
be the link capacity and the lower bound could be a minimum rate guaranteed 
to all sources. Now suppose that we want 6i{k) G [x,y] where x < 1 < y. For 
example, x could be 0.95 and y could be 1.05. Thus, we require 



which is satisfied if 



and 



Solving these yields 



and 



logf,(A'//mm) — logj(iFAmoa:) — J/, 

/ f max) — It)g^(A^Amm) — X. 



7 / fmax \ — — 

b={- )«-«, 

Jmin 



K - f 

— J max \ r ) 

Jmin 

Now the marking probability algorithm is as follows: 

~ Packets traversing link I are marked with probability (1 — 

— Using this marking scheme the rate of unmarked packets received by Virtual 
Session v is: 



,(A:) = exp (- . 

V / 



— The approximate value of min/g^^ fi{k) is calculated as 

K 



min fi(k) « , 



where 

5vmax = [-lnM„(fc)]^/”. 

— The transmission rate for each virtual session v in the network is given by 
Xy{k+ 1) = [(1 - fi)Xy{k) + (3wy min fi{k+ 

IGL-u Jmin 



where /3G(0,l)isa damping parameter. 
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3.3 Other Implementation Considerations 

In this subsection, we discuss the amount of information required to implement 
our multicast congestion control algorithm. We claim that, in the context of 
the IP multicast architecture, no additional per- multicast flow information is 
required beyond what is already needed to implement multicast routing and 
layered multicast. It is important to note that layered multicast is typically im- 
plemented within the IP architecture by having receivers join and leave multicast 
groups, see the work on receiver layered multicast by McCanne, Jacobsen and 
Vetterli HH and the more recent work of Rubenstein, Kurose, and Towsley Hg. 
Consequently, it is sufficient for the routers to propagate the marks to the re- 
ceivers based on their aggregate loads. The receivers then compute the prices 
and determine what rates they should receive at. This in turn determines which 
layers they should listen to. They then may choose to add layers (by joining mul- 
ticast groups) or drop layers (by leaving multicast groups). We have described 
how no additional per flow state is required in the context of IP. We expect this 
to be the case for any future multicast architectures. 

As far as the congestion control algorithm goes, one may argue that the 
routing table contains information about the number of multicast virtual sessions 
passing through a router which could be used to compute the fairshare at each 
node. However, it is important to recognize that our algorithm is intended for 
networks with unicast and multicast flows and therefore, it is significant that we 
do not require knowledge of the number of unicast flows through the link. For 
example, if a router handles 100, 000 flows, out of which 10, 000 are multicast 
flows and the rest are unicast, it does not maintain per-flow information on 
the 90,000 unicast flows. Thus, from a congestion control point of view, our 
algorithm does not require any additional per-flow information. In addition to 
aggregate flow information, we only require one-bit packet marking to convey 
congestion information. 



4 Local Stability and Rate of Convergence 

In this section, we study the conditions under which the algorithm presented is 
locally asymptotically stable. We make the following assumptions: 

— For each Session s, there is a Virtual Session v such that for all 

r gV{s)\ {u}, where {x*,.} denotes the max-min fair rates. In other words, 
under the max-min fair rate allocation, we assume that each multicast session 
has a unique virtual session which determines the overall transmission rate 
of the multicast source. 

— Under max-min fair rate allocation, each virtual session has a unique bot- 
tleneck link. 

— Each link is a bottleneck for at least one virtual session. 

We note that the above assumptions are required only for the convergence anal- 
ysis of this section but not for the implementation of the algorithm. 
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Proposition 41 The algorithm presented in the previous seetion is locally asymp- 
totically stable under the following conditions on e : 



0 < e < /Wi 

where Wi is the sum of the weights for the sessions bottlenecked at link 1. 

Proof: Recall that 

Xiik + l) = Xi{k) + eXi{k)ici-li{k)) 

= Xi{k) eXi{k){ci - max a:„(fc)) (12) 



Xy{k) = Wy min 1/Xm{k), (13) 

m^Lu 

where Wy, the weight of the Virtual Session v, is the weight of the corresponding 
multicast session to which it belongs. 

Let A* be the fairshare associated with link I under weighted max-min fair 
rate allocation. Define 

Xi{k) :=Xi{k)-Xt 

Let A(fc) denote the vector of link fairshares. Then, II 21 can be written as 

A(fc + l) = /(A(fc)), 

where /(•) is a vector whose element is the right-hand side of (II 2|l . To prove 
that this system is locally stable, we linearize around A = A* and prove that 
the eigenvalues of the resulting linear system lies within the unit circle in the 
complex plane. Thus, the linear system is given by 



eWirXl 1 

Xf 

eW2LXl 

Xf 



eWLL 

Af J 



where 



X{k+1) = AX{k). 



A = 



^1- 

dX 



1 - 



eWn eWi 2 Xl 



Af 



Af 



eW2iA2 ^ eW22 



A* 



*3 



A^2 



cWliXI cWl2XI 



\*3 



\*3 
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Here Wim is the sum of the weights for those sessions passing through link I 
that are bottlenecked at link m. L is given as the total number of links in the 
network. 

From the definition of Wim, it is easy to see that if Wij > 0 then Wji = 0. 
Note that Wij > 0 implies that some session passes through both links i and j 
and is bottlenecked at link j. This means that A* > A*. It is not simultaneously 
possible to have another session passing through both links i and j resulting in a 
bottleneck at link i, thus Wji = 0. In view of this, without loss of generality, let 
us label the link such that if a session passes through link i and is bottlenecked at 
link j, i is larger than j. This produces an upper triangular matrix, H, and thus, 
its eigenvalues are the main diagonal elements. Since the necessary and sufficient 
condition for the stability of a linear system is that all the eigenvalues have 
absolute value less than 1, the statement of the theorem follows immediately, o 

Remark 2. The speed of convergence is determined by the spectral radius (the 
maximum absolute value of the eigenvalues) of A. Since the eigenvalues of A are 
its diagonal elements, the spectral radius of each of the algorithms is given by 

max|l-eITi/(An^| o 



5 Simulation Results 

In this section, we study the algorithm presented in the previous sections through 
simulations on the simple Y-network of Figure ^ a more complicated net- 
work, which we call the general network. We normalize time such that an update 
interval for the fairshare at each link is one time slot. All link capacities should 
then be interpreted as being measured in terms of 100 packets per time slot, i.e., 
if the capacity of a link is 10, then it should be interpreted as 1000 packets per 
time slot. All packets are assumed to be of equal size and due to the above nor- 
malization, the actual size of the packet (in bytes) is irrelevant. The parameters 
K and b defined in Section O are chosen such that 0.95 < <5; < 1.05 for all links. 

We assume that no packets are lost in the buffer. This is reasonable if we as- 
sume that active queue management (AQM) schemes (e.g., [||) with ECN packet 
marking (f]) are employed to nearly eliminate losses in the network. As in (^, 
this would mean that, for each link, a target utilization is chosen, say 0.98. Then, 
the marking algorithms would compare the arrival rate to 0.98 times the link 
capacity, as opposed to the full link capacity. Thus, early warning of congestion 
would be provided and the arrival rate at any link would almost always be less 
than the link capacity leading to very low levels of packet loss. These issues have 
been addressed extensively in |B| and therefore, are not considered here. 

The following figures and simulation data are representative of the many sim- 
ulations done to examine the algorithm presented. The values chosen for simula- 
tion were chosen arbitrarily within the algorithm’s specifications to demonstrate 
max- min fair convergence. 
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5.1 Simple Y-Network 

For each simulation the following values were used: Ca = 30, Cb = 20, Cc = 10, 
wq = 1, tci = 2, W 2 = 3, n = 50. The minimum and maximum fairshares for 
each flow were chosen to be 0.3 and 30, respectively. We chose the stepsize e to 
be 0.01 and /? = 0.1. 

Figures 0 and 13 show the convergence of the session rates and the link fair- 
shares, respectively, for the algorithm. For this simple network, the max-min 
fair rate allocation can be exactly calculated as xq\ = 6|, Xq 2 = 2^, x\ = 13^, 
X 2 = 7^. Simulation of the algorithm shows convergence about the values 
Xqi = 6.6669, Xq 2 = 2.4997, Xi = 13.3337, X 2 = 7.4992, with a maximum 
oscillation of ±5% due to the randomness in the marking process. 

Figure El shows the convergence of the fairshare values for links A, B, and 
C. At convergence, the fairshares for links B and C oscillate around ^ = 
7.9877, ^ = 2.6650. As link A does not contain a bottlenecked session, its 
fairshare is bounded by the capacity of the link. After approximately 70 it- 
erations, ^ converged to 30. On the other hand, both links B and C con- 
tain a bottleneck session and therefore, their fairshare converges to a value 
smaller than ci. These values are similar to the expected convergence values 
of ^ = 30,^ = 6|,^ = 2^. The difference can be explained by the chosen 
value for n. Larger values of n show expected convergence of the fairshare values. 




Fig. 4. Y-network: Session Rates Fig. 5. Y-network: Link Fairshares 

vs Number of Iterations vs Number of Iterations 
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5.2 General Network 

The second set of simulations were done using the network in Figure El The 
19 links are identified by the numbers associated with them in the figure. This 
network carries traffic from 11 multicast and unicast sessions. The 6 multicast 
sessions have a total of 14 virtual sessions. Table Q] provides a list of the virtual 
session routes and weights, and the link capacities. 




Fig. 6. General Network. 



The other parameters were chosen to be e = 0.01, f} = 0.1, and n = 50. 
The minimum and maximum fairshares for each flow were chosen to be 0.1 and 
14, respectively. To simulate delays (possibly time-varying) in the network and 
asynchronous updates, we introduce a probability parameter p. The transmission 
rates of each virtual session in the network is updated at each time instant with 
probability p, and with probability 1 — p, a session continues to use its old rate. 
In this subsection, we provide results for the case where p = 0.5. 

Figures |3 and El show the convergence for four of the virtual sessions and 
four of the links in the network. Overall, session rates and fairshares converged 
within 5% of the expected values, given in Table [D Our results show that, for 
sufficiently small e and sufficiently large n, the algorithm converges to the max- 
min fair rates. 

6 Conclusions 

We have presented a simple, decentralized algorithm to achieve weighted max- 
min fairness in multirate, multicast networks. The algorithm is simple to imple- 
ment at the source and receivers as well as at the routers in the network since 
no per-fiow information is necessary. An ECN-like marking mechanism can be 
used to convey information from the network to the multicast receivers. The 
algorithm presented here can be easily generalized to the case where there are 
maximum and minimum rate constraints on the source transmission rates. 
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Table 1. Virtual Session Routes, Weights, Link Capacities, Max-min Fair Session 
Rates, and Link Fairshares 



Xsr 


Lsr 


Ws 


1 


Cl 


Session Rate 


1/Ar 


*01 


1,3,8 


1 


1 


14 


0.8002 


2.5613 


*02 


1,9,15 


1 


2 


9 


1.1356 


5.6035 


*1 


1,4,10,15 


3 


3 


6 


3.5995 


6 


*21 


1,5,11,16 


4 


4 


6 


1.6618 


5.3937 


*22 


1,5,12,18 


4 


5 


5 


1.7136 


0.5882 


*3 


1,5,11,17 


2 


6 


6 


0.8134 


6 


*41 


2,6,12,18 


2 


7 


6 


3.5006 


1.2832 


*42 


2,7,14,19 


2 


8 


4 


1.4995 


1.4132 


*5 


2,7,13,18 


3 


9 


4 


1.5010 


4 




1,3,9,15 


1 


10 


6 


1.1356 


5.3937 


*71 


1,3,8 


4 


11 


8 


3.2010 


8 


*72 


1,5,11,16 


4 


12 


7 


1.6618 


6.9546 


*73 


1,5,11,17 


4 


13 


1.5 


1.6268 


1.0628 


*81 


2,6,12,18 


1 


14 


5 


1.7503 


5 


*82 


2,6,11,16 


1 


15 


10 


1.5570 


10 


*83 


2,7,14,19 


1 


16 


5 


0.7498 


5 


*9 


2,7,14,19 


3 


17 


4 


2.2493 


4 


*101 


1,4,10,15 


2 


18 


9 


2.3996 


9 


*102 


1,5,1,17 


2 


19 


5 


0.8134 


5 



There are two open issues that we plan to address in the near future. One is 
a proof of global convergence of the algorithm presented here. The other issue, 
which is of practical importance, is the impact of discrete bandwidth layers on 
the performance of the scheme suggested here US!. 
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Abstract. The Source Specific Multicast (SSM) service model and pro- 
tocol architecture have recently been proposed as an alternative to the 
currently deployed Any Source Multicast (ASM) service. SSM attempts 
to solve many of the deployment problems of ASM including protocol 
complexity, inter-domain scalability, and security weaknesses. However, 
the SSM protocol architecture is not radically different from that of ASM. 
This has created opportnnities for integrating it into the cnrrently de- 
ployed ASM infrastructure. In this paper, we first describe the ASM 
and SSM service models and associated protocol architectures, high- 
lighting the relative merits and demerits of each. We then examine the 
network infrastructure needed to snpport both of them. Our conclusion 
is that integration is relatively straightforward in most cases; however 
there is one case — supporting ASM service over an SSM-only protocol 
architecture — for which it is difficult to design elegant solutions for an 
integrated SSM/ ASM infrastructure. 



1 Introduction 

The original IP multicast service model was developed with the goal of creating 
an interface similar to that of best-effort unicast traffic p. For transmitters, the 
goal was to provide scalable transmission by allowing sources to simply trans- 
mit without having to register with any group manager or having to perform 
connection setup or group management functions. The application programming 
interface was similar to that for UDP packet transmission — an application would 
simply have to open a socket to a destination and begin transmitting. For re- 
ceivers, the goal was to provide a way to join a group and then receive all packets 
sent by all transmitters to the group. In this service model, each multicast host 
group was identified by a class-D IP address so that an end-host could partici- 
pate in a multicast session without having to know about the identities of other 
participating end-hosts. This eventually led to a a triumvirate of protocols to 
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build multicast trees and forward data along them: a tree construction protocol 
(the most widely deployed of which is) called Protocol Independent Multicast- 
Sparse Mode (PIM-SM), the multicast equivalent of the Border Gateway Proto- 
col (BGP) for advertising reverse paths towards sources called the Multiprotocol 
Border Gateway Protocol (MBGP), and a protocol for disseminating informa- 
tion about sources called the Multicast Source Discovery Protocol (MSDP)j2|. 
In addition, the Internet Group Management Protocol (IGMP) was designed for 
end-hosts to dynamically join and leave multicast groups. 

The wide-scale commercial deployment of this service model and protocol 
architecture has run into significant barriers . Many of these barriers are rooted 
in the problem that building efficient multicast trees for dynamic groups of 
receivers is a non-trivial problem. As a result, the existing set of protocols is 
fairly complex and the learning curve is quite steep. Furthermore the “any- 
to-any” design philosophy of the current Any Source Multicast (ASM) service 
model is not suitable for commercial services. Most applications today need 
tighter control over who can transmit data to a set of receivers. 

By trying to improve both the efficiency and reduce the complexity of current 
multicast protocols, the goal is to reduce the barriers to deployment. However, 
developing and deploying yet another set of protocols creates the additional bur- 
den of yet another round of modifications to the existing infrastructure. This may 
itself become an impediment to deployment efforts. Therefore, it is important to 
to identify the technical problems that needs to be solved before designing and 
deploying a new protocol architecture. For the current ASM architecture and 
service model, the problems include: 

1. attacks against multicast groups by unauthorized transmitters 

2. deployment complexity 

3. problems of allocating scarce global class-D IP addresses 

4. lack of inter-domain scalability 

5. single point of failure problems 

A new service model. Source Specific Multicast (SSM), and an associated 
protocol architecture have been proposed as a solution to the above problems 
and is beginning to be deployed 0. In the SSM service model a receiving host 
explicitly specifies the address of the source it wants to receive from, in addition 
to specifying a class-D multicast group address. 

From a deployment standpoint, the fundamental advantage of SSM is that 
protocol complexity can be removed from the network layer and implemented 
more easily, simply, and cheaply at the application layer. The tradeoff, which it- 
self has both advantages and disadvantages, has the potential to fundamentally 
change how IP multicast service is provided in the Internet. Moreover there is 
significant overlap in the protocol architectures for ASM and SSM, thereby fa- 
cilitating the rapid integration of SSM support in networks that already support 
ASM. 

The key difference between the two service models lies in the way a receiving 
host joins a multicast group. As a result, there are a number of questions that 
arise about supporting the two. Should the two models exist simultaneously? 
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Should the existence of two models be made visible to the user? What part of the 
multicast infrastructure should be responsible for dealing with interoperability 
between the two service models? Answering these questions is a critical step in 
providing seamless interoperability between ASM and SSM. 

In this paper, we describe the differences between the ASM and SSM protocol 
architectures and service models. We then study the challenges of deploying 
an integrated ASM/SSM infrastructure. We believe that SSM can solve many 
of the technical problems without making the existing infrastructure obsolete. 
But, technical challenges exist in seamlessly integrating the two without creating 
“black holes”. Fortunately, we find that in most cases, the problems faced in 
integrating the two architectures are neither many nor insurmountable; they 
simply need to be identified and then the appropriate solutions implemented. 

The remainder of the paper is organized as follows. In Section El we describe 
the multicast service models. Section El describes the ASM and SSM protocol 
architectures. Section 0 discusses the challenges in integrating ASM and SSM. 
The paper is concluded in Sectional 

2 IP Multicast Service Models 

In order to understand the implication of offering different types of IP multicast 
services, we first need to make a distinction between a protocol architecture and a 
service model. A multicast protocol architecture refers to a set of protocols that 
together allow end-hosts to join/leave multicast sessions, and allows routers to 
communicate with each other to build and forward data along inter-domain 
forwarding trees. An IP multicast service model refers to the semantics of the 
multicast service that a network provides an end-user with. It is embodied in the 
set of capabilities available to an end-user at the application interface level, and 
is supported by a network protocol architecture. Any multicast service model is 
realized through: 

— an application programming interface (API) used by applications to com- 
municate with the host operating system. 

— host operating system support for the API. 

— protocol(s) used by the host operating system to communicate with the leaf 
network routers (referred to as designated routers or edge-routers). 

— protocol(s) for building inter-domain multicast trees and for forwarding data 
along these trees. 

With multiple service models and protocol architectures, the challenge there- 
fore lies in bridging the gap between the protocol architecture deployed in the 
network and the service model expected by end-user applications. 

Currently, there are two main IP multicast service models plus a third de- 
riving from a combination of the two. The details of the protocol architecture 
supporting each of them is described in Section 0 

(a) Any-Source Multicast (ASM): This is the traditional IP multicast ser- 
vice model defined in RFC 1112p. An IP datagram is transmitted to a 
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Fig. 1. Three choices for the IP multicast service model. 



“host group” , a set of zero or more hosts identified by a single IP destination 
address (224.0.0.0 through 239.255.255.255 for IPv4). This model supports 
one-to-many and and many-to-many multicast communication. Hosts may 
join and leave the group at any time. There is no restriction on the loca- 
tion or number of receivers, and a source need not be a member of the host 
group it transmits to. Host-to-network communication support for ASM is 
provided by the Internet Group Management Protocol (IGMP) version 2. 
IGMPv2 allows a receiver to specify a class-D group address for the host 
group it wants to join, but does not allow it to specify the sources that it 
wants (or does not want) to receive traffic from. This service model is shown 
in Figure ^ a). 

(b) Source- Specific Multicast (SSM): This is the multicast service model 
defined in |5| . An IP datagram is transmitted by a source S to an SSM address 
G, and receivers can receive this datagram by subscribing to channel (S,G). 
SSM is derived from EXPRESS jOj and supports one-to-many multicast. The 
address range 232/8 has been assigned by IANA[3| for SSM service in IPv4. 
In IPv6, an address range {FF2x :: and FF3x ::) already exists|S| for SSM 
services. IGMP version 3, which allows a receiver to specify explicitly the 
source address, provides host-to-network communication support for SSM. 
This requires upgrading most host operating systems and edge routers from 
IGMPv2 to IGMPv3. This also implies that the host operating system’s 
API must now allow applications to specify a source and a group in order 
to receive multicast traffic. This service model is shown in Figure D(b). 
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A variant of the ASM service model is known as the Source-Filtered Mul- 
ticast (SFM) model. In this case, a source transmits IP datagrams to a host 
group address in the range of 224.0.0.0 to 239.255.255.255. However, each ap- 
plication can now request data sent to a host group G for only a specific set 
of sources, or can request data sent to host group G from all except a specific 
set of sources. In other words, applications can apply “source filtering” to the 
multicast data being transmitted to a to a given host group. Host-to-network 
support for source filtering is provided by IGMPvS for IPv4, and version 2 of 
the Multicast Listener Discovery (MLD) protocol for IPv6|^. 

3 IP Multicast Protocol Architectures 

In this section we describe in detail the protocol architectures for supporting the 
ASM and SSM service models, and the relative merits and demerits of each. 

3.1 ASM Protocol Architecture 

The current inter-domain multicast architecture is based on the ASM service 
model. To become a member of a particular group, end-hosts register their mem- 
bership with querier routers handling multicast group membership functionality 
using the IGMP version 2 (IGMPv2) protocol|in] for IPv4 or the MLD version 
I (MLDvI) protocoiP5 for IPv6. With IGMPv2 and MLDvl, source-filtering 
capabilities are not available to receivers. 

Multicast-capable routers then construct a distribution tree by exchanging 
messages with each other according to a routing protocol. A number of different 
protocols exist for building multicast forwarding trees. These protocols differ 
mainly in the type of delivery tree constructed [llll2ll3ll4H5j . Of these, the Pro- 
tocol Independent Multicast Sparse-Mode (PIM-SM) protocol^ is the most 
widely deployed in today’s public networks. PIM-SM, by default, constructs a 
single spanning tree rooted at a core Rendezvous Point (RP) for all group mem- 
bers within a domain. Local sources then send their data to this RP which 
forwards the data down the shared tree to interested local receivers. A receiver 
joining a host group can only specify interest in the entire group and there- 
fore will receive data from any source sending to this group. Distribution via a 
shared tree can be effective for certain types of traffic, e.g., where the number 
of sources is large since forwarding on the shared tree is performed via a single 
multicast forwarding entry. However, there are many cases (e.g., Internet broad- 
cast streams) where forwarding from a source to a receiver is more efficient via 
the shortest path. PIM-SM also allows a designated router serving a particular 
subnet to switch to a source-based shortest path tree for a given source once the 
source’s address is learned from data arriving on the shared tree. This capability 
provides for distribution of data from local sources to local receivers using a 
common RP inside a given PIM domain. 

It is also possible for RP’s to learn about sources in other PIM domains 
by using the Multicast Source Discovery Protocol (MSDP)[IS|. Once an active 
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remote source is identified, an RP can join the shortest path tree to that source 
and obtain data to forward down the local shared tree on behalf of interested 
local receivers. Designated routers for particular subnets can again switch to 
a source-based shortest path tree for a given remote source once the source’s 
address is learned from data arriving on the shared tree. 

The IGMPv2/PIM-SM/MSDP-based inter-domain multicast architecture sup- 
porting ASM has been deployed in IPv4 networks. It has been particularly ef- 
fective for groups where sources are not known in advance; when sources come 
and go dynamically; or when forwarding on a common shared tree is found to 
be operationally beneficial. 

However, there are several problems hindering the commercial deployment 
of these protocols. Some of these are inherent in the service model itself, while 
others are due to the complexity of the protocol architecture: 

— Attacks by unauthorized transmitters: In the ASM service model, a 
receiver cannot specify which specific sources it would like to receive data 
from when it joins a given group. A receiver is forwarded data sent by all 
group sources. This lack of access control can be exploited by malicious 
transmitters to disrupt data transmission from authorized transmitters. 

— Deployment complexity: The ASM protocol architecture is complex and 
difficult to manage and debug. Most of the complexity arises from the RP- 
based infrastructure needed to support shared trees, and from the MSDP 
protocol used to discover sources across multiple domains. These challenges 
often make network operators reluctant to enable IP multicast capabilities in 
their networks, even though most of today’s routers support the IGMP/PIM- 
SM/MSDP protocol suite. 

— Address allocation: This is one of the biggest challenges in deploying an 
inter-domain multicast infrastructure supporting ASM. The current multi- 
cast architecture does not provide an adequate solution to prevent address 
collisions among multiple applications. As a result two entirely different mul- 
ticast sessions may pick the same class-D address for their multicast groups 
and interfere with each other’s transmission. The problem is more serious for 
IPv4 than IPv6 since the total number of multicast addresses is smaller. A 
static address allocation scheme, GLOP ini, has been proposed as an interim 
solution for IPv4. GLOP addresses are allocated per registered Autonomous 
System (AS). However, the number of addresses per AS is inadequate when 
the number of sessions exceeds an AS’s allocation. Proposed longer-term so- 
lutions such as the Multicast Address Allocation Architecture (MAAA)JIHj 
are generally perceived as being too complex (with respect to the dynamic 
nature of multicast address allocation) for widespread deployment. Another 
long term solution, the unicast-prefix-based multicast architecture of IPv6 Pj 
expands on the GLOP approach; simplifies the multicast address allocation 
solution; and incorporates support for source-specific multicast addresses. 

— Inter-domain scalability: MSDP has always been something of an ugly 
solution. The protocol has weaknesses in terms of security and scalability. 
For security, it is susceptible to denial-of-service attacks by domains sending 
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out a flood of source announcements. For scalability, MSDP is not well de- 
signed to handle large numbers of sources. The primary reason is because the 
source announcements were designed to be periodically flooded throughout 
the topology and to carry data. As the number of sources in the Internet 
increases, MSDP will generate greater amounts of control traffic. 

— Single point of failure: When multicast data distribution takes place over 
a shared tree via a core network node (RP in the case of PIM-SM), failure of 
the core can lead to complete breakdown of multicast communication^. In the 
ASM protocol architecture, a receiver is always grafted on to an RP-based 
shared tree when it first joins a multicast group. This reliance on the shared- 
tree infrastructure makes the ASM protocol architecture fundamentally less 
robust. 



3.2 SSM Protocol Architecture 

As mentioned before. Source Specific Multicast (SSM) defines a service model 
for a “channel” identified by an (S,G) pair, where S is a source address and G is 
an SSM address. This model can be realized by a protocol architecture, where 
packet forwarding is restricted to shortest path trees rooted at specific sources, 
and channel subscriptions are described using a group management protocol 
such as IGMPvS or MLDv2. 

The SSM service model alleviates all of the deployment problems described 
earlier: 

— The distribution tree for an SSM channel (S,G) is always rooted at the 
source S. Thus there is no need for a shared tree infrastructure. In terms 
of the IGMPv2/PIM-SM/MSDP architecture, this implies that neither the 
RP-based shared tree infrastructure of PIM-SM nor the MSDP protocol is 
required. Hence the protocol architecture for SSM is significantly less com- 
plex than that for ASM, making it easy to deploy. In addition, SSM is not 
vulnerable to RP failures or denial-of-service attacks on RP(s). 

— SSM provides an elegant solution to the access control problem. Only a 
single source S can transmit to a channel (S,G) where G is an SSM address. 
This makes it significantly more difficult to spam an SSM channel than 
an ASM host group. In addition, data from unrequested sources need not 
be forwarded by the network, which prevents unnecessary consumption of 
network resources uni. 

— SSM defines channels on a per-source basis; hence SSM addresses are “lo- 
cal” to each source. This averts the problem of global allocation of SSM 
addresses, and makes each source independently responsible for resolving 
address collisions for the various channels that it creates. 

— It is widely held that point-to-multipoint applications such as Internet TV 
will dominate the Internet multicast application space in the near future. The 

^ Multiple cores can certainly be used to alleviate this problem, but redundancy comes 
at the price of extra overhead and complexity. 
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SSM model is ideally suited for such applications. Thus the deployment of 
SSM will provide tremendous impetus to inter-domain Internet multicasting 
and will pave the way for a more general multipoint-to-multipoint service in 
the future. 

A protocol architecture for SSM requires the following: 

— Source specific host membership reports: The host-to-network protocol 
must allow a host to describe specific sources from which it would like to 
receive data. 

— Shortest path forwarding: DR’s must be capable of recognizing receiver- 
initiated, source-specific host reports and initiating (S,G) joins directly to 
the source. 

— Elimination of shared tree forwarding: In order to achieve global ef- 
fectiveness of SSM, all networks must agree to restrict data forwarding to 
source trees (i.e., prevent shared tree forwarding) for SSM addresses. The ad- 
dress range 232/8 has been allocated by lANA for deploying source-specific 
IPv4 multicast (SSM) services. In this range, SSM is the sole service model. 
For IPv6, a source-specific multicast address range has been defined |H], as a 
special case of unicast prefix-based multicast addresses. 

We now discuss the framework elements in detail: 

— Channel discovery: In the case of ASM, receivers need to know only the 
group address for a specific session. In the IGMPv2/PIM-SM/MSDP archi- 
tecture, designated routers discover an active source via the RP infrastruc- 
ture and MSDP, and then graft themselves to the multicast forwarding tree 
rooted at that source. In the case of SSM, an application on an end-host must 
know both the SSM address G and the source address S before subscribing 
to a channel. Thus the function of channel discovery becomes the responsi- 
bility of applications. This information can be made available in a number 
of ways, including via web pages, sessions announcement applications, etc. 

— SSM-aware applications: The advertisement for an SSM session must 
include a source address as well as a group address. Also, applications sub- 
scribing to an SSM channel must be capable of specifying a source address 
in addition to an group address. In other words, applications must be SSM- 
aware. Specific API requirements are identified in 123 . 

— Address Allocation: For IPv4, the address range of 232/8 has been as- 
signed by lANA for SSM. Sessions expecting SSM functionality must allo- 
cate addresses from the 232/8 range. To ensure global SSM functionality in 
232/8, including in networks where edge routers run IGMPv2 (i.e., do no 
support source filtering), operational policies are being proposed^ which 
prevent data sent to 232/8 from being delivered via shared trees. 

Note that it is possible to achieve the benefit of direct and immediate (S,G) 
joins in response to IGMPv3 reports in other ranges than 232/8. However, 
non-SSM address ranges allow for concurrent use of both the ASM and SSM 
service models. Therefore, while we can achieve the PIM join efficiency in 
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the non-SSM address range with IGMPvS, it is not possible to prevent the 
creation of shared trees or shared tree data delivery, and thus cannot provide 
for certain types of access control or assume per-source unrestricted address 
use as with the SSM address range. 

In the case of IPv6, |H| has defined an extension to the addressing architecture 
to allow for unicast prefix-based multicast addresses. In this case, bytes 0-3 
(starting from the least significant byte) of the IP address is used to specify 
a multicast group id, bytes 4 — 11 is be used to specify a unicast address 
prefix (of up to 64 bits) that owns this multicast group id, and byte 12 is 
used to specify the length of the prefix. A source-specific multicast address 
can be specified by setting both the prefix length field and the prefix field 
to zero. Thus IPv6 allows for 2^^ SSM addresses per scope for every source, 
while IPv4 allows 2^"* addresses per source. 

— Host-to-network communication: The currently deployed version of 
IGMP (IGMPv2) allows end-hosts to register their interest in a multicast 
group by specifying a class-D IP address for IPv4. However in order to im- 
plement the SSM service model, an end-host must specify a source’s unicast 
address as well as an SSM address. This capability is provided by IGMP 
version 3 (IGMPv3). IGMPv3 supports “source filtering”, i.e., the ability 
of an end-system to express interest in receiving data packets from only 
a set of specific sources, or from all except a set of specific sources. Thus 
IGMPv3 provides a superset of the capabilities required to realize the SSM 
model. Hence an upgrade from IGMPv2 to IGMPv3 is an essential change 
for implementing SSM. 

IGMPv3 requires the API to provide the following operation (or its logical 
equivalent 1 PH : 

IP Multicast Listen{Socket, IF, G, filter — mode, source — list) 

As explained in the IGMPv3 specification PJ, the above IPMulticastListen() 
operation subsumes the group-specific join and leave operations of IGMPv2. 
Performing (S,G)-specific joins and leaves is also trivial. A join operation is 
equivalent to: 

I PMulticastListen{Socket, IF, G, INCLUDE, S) 
and a leave operation is equivalent to 

I P MulticastListen{Socket, IF, G, EXCLUDE, S) 

There are a number of backward compatibility issues between IGMP versions 
2 and 3 which have to be addressed. There are also some additional require- 
ments for using IGMPv3 for the SSM address range. A detailed discussion 
of these issues is provided in [22| . 

The Multicast Listener Discovery (MLD) protocol is used by an IPv6 router 
to discover the presence of multicast listeners on its directly attached links, 
and to discover the multicast addresses that are of interest to those neigh- 
boring nodes. Version 1 of MLD HH is derived from IGMPv2 and allows a 
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multicast listener to specify the multicast group(s) that it is interested in. 
Version 2 of MLD0 is derived from, and provides the same support for 
source-filtering as, IGMPvS. 

— PIM-SM modifications for SSM: PTM-SM [1 4] itself supports two types 
of trees, a shared tree rooted at a core (RP), and a source-based shortest 
path tree. Thus PIM-SM already supports source-based trees; however, PIM- 
SM is not designed to allow a router to choose between a shared tree and a 
source-based tree. In fact, a receiver always joins a PIM shared tree to start 
with, and may later be switched to a per-source tree by its adjacent edge 
router. 

A key to implementing SSM is to eliminate the need for starting with a shared 
tree and then switching to a source-specific tree. This involves several changes 
to PIM-SM as described in M- The resulting PIM functionality is referred 
to as PIM-SSM. The most important changes to PIM-SM with respect to 
SSM are as follows: 

• When a DR receives an (S,G) join request with the address G it must 
initiate a (S,G) join and never a (*,G) join. 

• Backbone routers (i.e. routers that do not have directly attached hosts) 
must be capable of receiving (S,G) joins and forwarding them based on 
correct RPF information. In addition, they must not propagate (*,G) 
joins for group addresses in the SSM address range. 

• Rendezvous Points (RPs) must not accept PIM Register messages or 
(*,G) join messages. 

In summary, the ASM service model and protocol architecture suffer from 
a number of serious deployment problem 0. The SSM service model addresses 
many of the needs of today’s commercial multicast applications. Also, the as- 
sociated protocol architecture is simpler, and easy to deploy in networks that 
already supports ASM. The challenge then becomes integrating the two. 

4 Integrating ASM and SSM 

In this section, we examine interoperability issues between ASM and SSM. From 
our discussion so far, it is clear that there is significant overlap between the two 
protocol architectures. Therefore, it is possible to integrate the two. The task 
is to then investigate the interoperability issue from a host perspective, i.e. if 
a host is connected to a network, what does it have to do to properly utilize 
whatever multicast service is present? 

Given that the two service models and two protocol architectures form a set 
of four combinations, the challenge is to identify any problems in providing a 
seamless multicast service — including both intra- and inter-domain operation. 
As we have discovered, most of these scenarios are trivially workable. Of those 
that remain, one requires minor changes to existing protocols, and one is quite 
challenging. Our goal is to (1) identify what the interoperability problems are, (2) 
identify solutions to these problems, and (3) understand the relative complexities 
of deploying these solutions. 
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In the next section we describe the four combinations of service models and 
protocol architectures. Following the overview, we focus specifically on solutions 
for the most difficult case. 

4.1 Service Model and Protocol Architecture Combinations 

There are four combinations of service models and protocol architectures. These 
are shown in Figure 0 The key challenge naturally occurs because the host does 
not know how the network is configured. This is actually a reasonable abstrac- 
tion. The host should simply join a multicast group. If IGMPvS is available and 
the application knows who the source is, this information should be passed to 
the network. If this information is not available or if IGMPvS is not supported, 
the network should still respond in a predictable manner. 

The challenge of deploying a multicast service is to provide correct operation 
for all kinds of multicast no matter if (1) the host is limited to only IGMPv2 
and/or (2) the network is limited only to SSM support. In fact, the ability to 
interoperate with ASM was one of the key requirements for SSM0. This re- 
quirement was critical because the development of any completely new protocol 
architecture would mean that multicast deployment efforts would have to start 
completely over. Furthermore, given the near infinite lifetime of legacy architec- 
tures, ASM would in all likelihood continue to exist and need to be supported. 
Therefore, a new protocol architecture that did not integrate with ASM would 
not reduce complexity but rather increase it. Therefore, SSM was designed to 
interoperate with ASM. However, because SSM implements a subset of ASM 
functionality, there needs to be additional work to properly integrate the two. 

The challenge with integrating ASM and SSM is how to handle the discon- 
tinuities between the two. If the two are not integrated properly, multicast does 
not work. Even more problematic is that there is no feedback from any part of 
the network or host that says multicast is not working. Gases when this kind 
of behavior occurs are called “black holes”. Black holes occur when both the 
network and the host are operating correctly, but no multicast flows because 
there is a disconnect between the service model and the protocol architecture. 
For example, the network only allows hosts to specify an explicit list of sources 
but the host sends a (*,G) join. The network cannot process this kind of message 
and ignores it. There is no feedback to the host that the join message was not 
properly handled. 

Before describing the specific black hole scenarios, we first describe the sce- 
narios that are more straightforward. Figure 0 shows all four combinations. They 
are: 



— ASM service model and ASM network: The upper-left scenario is the 
service model and protocol architecture that has been running in the Internet 
since 1997121. Theoretically, there are no black holes in this combination, 
though in practice, problems often occurPSE3- 

— SSM service model and ASM network: The upper-right scenario is 
essentially the same protocol architecture that has been running since 1997, 
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Fig. 2. The four combinations of service models and protocol architectures. 



but with IGMPvS support. Through IGMPvS, users are given the ability to 
specify a subset of all group receivers, thereby refining the granularity of join 
and leave messages to more than just one choice for all sources. In the section 
listing the service models, this combination is also called Source-Filtered 
Multicast (SFM). SSM is supported in this combination in the address range 
232/8|2S|. Theoretically, there are no black holes in this combination. 
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— ASM service model and SSM network: The lower-left scenario is the 
most problematic of the four combinations. From the network point-of-view, 
the service provider has opted to only provide support for SSM. But for 
any of a variety of reasons, a host either chooses to send or is only capable 
of sending (*,G) join messages. The dilemma is how to solve this problem. 
Several possible solutions are discussed in the next section. 

— SSM service model and SSM network: The lower-right scenario is more 
straightforward than the previous case, but there is still one problem. Be- 
cause the multicast address space is divided into SSM and non-SSM ranges, 
the straightforward behavior is when the group address is in the SSM range 
(232/8)|2S|. Uncertainty occurs when handling (S,G) joins for the non-SSM 
range. There are two considerations to understand: 

• One of the dependencies here is whether the network is providing SSM 
support only in the 232/8 address range or whether it has been extended 
to cover the entire multicast address range (224/4). If the choice is only 
to support the 232/8 range, how should the network, host, and applica- 
tion handle (S,G) joins for addresses outside this range? The currently 
accepted practice seems to be to not allow SSM support outside of the 
232/8 range and embed these semantics in the operating system. 

• If the network provider instead chooses to provide SSM support for the 
entire address range, a problem is created for sources. Sources trans- 
mitting on a non-SSM address will not have their existence announced 
throughout the inter-domain infrastructure. To understand why this sit- 
uation occurs, consider the behavior in an ASM domain. When a source 
sends its first packet, the network encapsulates it and sends it to the RP. 
Since MSDP runs in the RP, an SA message is generated and flooded on 
the MSDP peering topology. Because the domain chooses to run SSM for 
the entire address space, there is no RP, no initial packet encapsulation, 
and no MSDP peer. Receivers in ASM domains will never discover the 
existence of this particular source. Again, the current accepted practice 
seems not to provide SSM-style service for addresses outside of the 232/8 
range. 

There are a number of solutions to solving the problem of an ASM service 
model running in an SSM network. These solutions are discussed in the next 
section. 



4.2 Handling ASM Hosts in an SSM Infrastructure 

For hosts that can only speak IGMPv2, operation in an SSM-only network is 
difficult. Even for hosts that do speak IGMPv3 there are inter-domain consis- 
tency problems if SSM behavior is enforced beyond the 232/8 address range. 
The first step to solving these interoperability problems is to understand more 
clearly what the problems are. At a minimum, applications need deterministic, 
predictable behavior. Ideally, applications should be able to maintain some level 
of abstraction from the type of multicast service. However, implementing this 
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in the Internet looks to be quite difficult. The problem is that because there 
are different semantics tied to the multicast address space (232/8 vs. the rest of 
224/4), different behavior is expected depending on the address used. Therefore, 
hosts need to have some awareness of these semantics and what the network 
supports. While a host does not need a complete understanding of what the 
network protocol architecture is, it needs to know whether its join messages are 
going to be processed properly. 

One of the fundamental problems is that a host sending a (*,G) join into 
an SSM network will have the join message ignored. Therefore, some additional 
action must be taken by the network if IGMPv2 hosts are to be supported. The 
first two solutions attempt to resolve the (*,G) into a set of sources. Two choices 
of this type are shown in the top half of Figure Eland described below. 

— Run an MSDP peer: An SSM-only domain can run MSDP. Leaf routers 
would query the MSDP cache for source information. This solution is shown 
in FigureOfa). A straightforward implementation based on existing protocols 
would be to run an MSDP peer at the domain boundary and then use a 
new protocol for communicating between the peer and the leaf routers. The 
obvious disadvantage of this solution is that it appears to have almost as 
much complexity as ASM. One savings is in not having to run an RP. A 
second savings is being able to run the source discovery protocol at the 
application layer and not embed it in the network layer. By implementing 
an MSDP peer as an application and then adding some basic functionality 
just to the leaf routers, this solution can make sense. 

— URL Rendezvous Directory (URD): Similar to running an MSDP peer, 
URD involves translating network-layer complexity into application-layer 
management overhead. The idea is that a host will use the web to gather 
information about the multicast group and will then initiate the group join 
based on this information. This solution is shown in Figure m- When a 
user clicks on a link, the response is to send both the source and group in- 
formation back to the client as an HTTP re-direct to the URD port (Port 
659). For example, the returned link might look like the following: 

http : / / content-source . com : 659/source-address , group-address/ 

The router is intercepting traffic sent to port 659, and in combination with 
the IGMPv2 join message sent by the user’s application, the router will be 
able to issue an (S,G) join to the source. The goal of URD is to be able to put 
all of the additional functionality necessary for SSM at the content source 
site and in the leaf routers. This objective is achieved because no additional 
modifications are required in the application or the host operating system. 
The application responds normally to the HTTP re-direct and the operating 
system issues an IGMPv2 (*,G) join. There is even support to avoid black 
holes in the case where a content site uses URD but the leaf router does 
not support it. What happens is that the router does not intercept the URL 
re-direct and so it reaches the content source. If this re-direct appears, the 
content source knows the leaf router did not intercept the re-direct and can 
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Fig. 3. Four possible solutions for ASM hosts in an SSM infrastrncture. 



inform the user via a web page that the join did not work. Of course, the 
major disadvantage is that this requires routers to snoop on port 659 and 
intercept traffic — not an acceptable requirement for most network providers. 

Given that these two solutions require application-layer mechanisms to re- 
place the functionality of PIM RPs and MSDP, neither solution is particularly 
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elegant. A better solution would be to simply limit what the user can and can- 
not do. For example, the idea would be to prohibit joins to non-232/8 group 
addresses in SSM-only networks. Black holes would be avoided by letting the 
user/application/host know that the unsupported join actions had failed. Details 
on two particular solutions are shown at the bottom of Figure 0and described 
below. 

— IGMPvS with reject capability: The idea is to modify IGMPvS to create 
a more robust control path. The solution is to allow the leaf router to “reject” 
an IGMP join. This solution is shown in Figure Eljc). A number of possible 
reasons could exist for rejecting a join. The obvious case is when a (*,G) join 
is set for an address configured only to allow (S,G) joins. Another example 
might be joins sent to a group that has been listed in a site-controlled reject 
list. The IGMP reject message could include a return code providing a reason 
for the rejection. This solution would require either an addition to IGMPvS 
or a new version. While this solution seems quite reasonable, it requires 
revising IGMPvS which creates yet another deployment delay. 

— Host discovery of network capability: Hosts could be given the capabil- 
ity to discover for themselves how the network is configured. This solution 
is shown in Figure 0(d). In this way, hosts could determine what group ad- 
dresses require a specific set of sources and what group addresses allow the 
use of the in join messages. This discovery provides the host with enough 
information to reject an application’s join request. Like many of the other 
solutions, the problem is host behavior needs to be modified. 

Either of the last two solutions is the most reasonable. But the changes 
require another round of deployment. The practical solution is for all routers, 
operating systems, and applications to assume that SSM runs only in the 2S2/8 
range, and that some networks might only support SSM. Therefore, any (S,G) 
join should be expected to work, but any (*,G) join should be suspect. The simple 
policy should be that if an application has access to the set of one or more group 
sources, it should use them. Otherwise, the possibility exists that (*,G) joins will 
not be successful. This creates a certain amount of non-determinism but seems 
easy to characterize: IGMPv2 joins might not work. The incentive is to upgrade 
to IGMPvS as quickly as possible. 

5 Conclusions 

In this paper, we have considered the integration of the two service models for IP 
multicast: Any Source Multicast (ASM) and Source Specific Multicast (SSM). 
We have described the protocol architecture for each and discussed their advan- 
tages and disadvantages. ASM is the traditional service model; however, it suffers 
from a number of serious problem from a commercial deployment standpoint. 
SSM solves most of these which makes it suitable for rapid deployment. The im- 
portant advantage of SSM is the significant overlap of its protocol architecture 
with that of SSM. We have explored the interoperability of these two service 
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models, and found that in most cases, the challenges are not insurmountable. 
In the near term, we expect these two service models to co-exist in a unified IP 
multicast infrastructure. 

From a broader perspective, successfully integrating ASM and SSM should 
have a positive impact on the use and deployment of multicast. IP multicast has 
long suffered from the “chicken and egg” problem. The lack of popular applica- 
tions has given very little incentive to ISPs to enable multicast in their networks. 
Furthermore, ASM has been plagued by a number of deployment problems. On 
the other hand, the lack of widespread deployment has resulted in very limited 
interest in developing new applications. At the same time, the popularity of 
application-layer multicast has further slowed down deployment of IP multicast. 
It is hoped that SSM will spur the deployment and use of IP multicast by virtue of 
its simplicity, ease of deployment, and its ability to be integrated into the existing 
infrastructure. While SSM is ideally suited for point-to- multipoint, multi-peer 
applications such multi-party games can easily be supported by building relays 
at the application level over an SSM-capable network. Such an approach repre- 
sents an attractive compromise between the efficiency of network-level multicast 
and the ease of manageability of application-level multicast. 
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Abstract. Multi-stage packet switches that feature a limited amount 
of buffers in the switching fabric and distribute most of their buffer- 
ing capacity over the port cards have recently gained popularity due 
to their scalability properties and flexibility in supporting Quality-of- 
Service (QoS) guarantees. In such switches, the replication of multicast 
packets typically occurs at the outputs of the switching fabric. This ap- 
proach minimizes the amount of resources needed to sustain the internal 
expansion in traffic volume due to multicasting, but also exposes mul- 
ticast flows to head-of-line (HOL) blocking in the ingress port cards. 
Access regulation to the fabric buffers is of the utmost importance to 
safeguard the QoS of multicast flows against HOL blocking. 

We add minimal overhead to a well-known distributed scheduler for 
multi-stage packet switches to define the Generalized Distributed Multi- 
layered Scheduler (G-DMS), which achieves full support of QoS guaran- 
tees for both unicast and multicast flows. The novelty of the G-DMS is in 
the mechanism that regulates access to the fabric buffers, which combines 
selective backpressure with the capability of dropping copies of multicast 
packets that violate the negotiated proHles of the corresponding flows. 



1 Introduction 

Many envisioned applications in the Internet, such as broadcast video, video- 
conferencing, multi-party telephony, and work-group applications, are multicast 
in nature and are expected to generate a significant portion of the total traffic. 
In spite of the growing importance of multicast traffic in the Internet, the sup- 
port of Quality- of- Service (QoS) guarantees for multicast flows is still far from 
satisfactory in current-generation packet switches. 

We consider the integration of multicast traffic in existing QoS frameworks 
for unicast traffic in multi-stage switches |1l‘il3l4j . focusing in particular on the 
Distributed Multilayered Scheduler (DMS) |3l4j . The DMS meets the throughput, 
delay, and delay-jitter requirements of unicast flows by closely approximating an 
ideal scheduling hierarchy 0 in a distributed system. It applies to multi-stage 
switches that include a layer of ingress port cards, a switching fabric with a 
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moderate amount of buffers, and a layer of egress port cards |fil7| . The fab- 
ric may be implemented as a stand-alone shared-memory module, or expanded 
to achieve higher aggregate capacity in a three-stage Memory / Space/ Memory 
(MSM) arrangement |4I6| . For simplicity of presentation, we restrict the scope 
of this paper to the stand-alone implementation of the switching fabric. 

The DMS associates a separate scheduling tree with each switch output. The 
schedulers that distribute service at the contention points of the switch (in the 
ingress port cards and fabric outputs, but not in the egress port cards, which 
we assume to be bufferless 0) constitute the nodes of the scheduling trees. The 
presence of limited amounts of buffers in the fabric decouples scheduling trees 
that overlap at one or more nodes, and therefore enables the association of a 
distinct scheduling hierarchy with each switch output. In the fabric, buffers are 
statically partitioned in multiple queues, called QoS channels. Within a traffic 
class, the DMS establishes a QoS channel for each input-output pair. Selective 
backpressure regulates the admission of packets to the QoS channels m- when 
the amount of packets in a channel exceeds a given threshold, the assertion of 
backpressure forces further traffic destined for that channel to remain in the 
corresponding ingress port card. 

The generalization of the DMS for the integrated provision of QoS guarantees 
to unicast and multicast flows requires the adaptation of the QoS framework to 
the multicasting scheme of the underlying switch architecture. Our reference 
switch always replicates multicast packets according to a minimum multicast 
tree, that is, as far downstream as possible (and therefore never in the ingress 
port cards) 0. A single instance of a multicast packet is stored in the fabric 
memory, and replicated (if necessary) only when the packet is transmitted to 
one of its outputs. After receiving a multicast packet, the fabric generates a 
pointer for each output in its multicast distribution (or fanout), and links it to 
the corresponding output queue (a linked pointer virtually identifies a copy of 
the packet). 

The adoption of minimum multicast trees in a distributed switch with multi- 
ple layers of contention points, albeit optimal in the utilization of bandwidth and 
buffering resources, may compromise the provision of QoS guarantees for both 
unicast and multicast flows. In fact, head-of-line (HOL) blocking may occur if 
packets with different multicast distributions share portions of their forwarding 
paths before they are replicated (the forwarding path of a packet is identified 
by the sequence of queues that the packet traverses in the switch) . In our multi- 
stage switch, the problem arises in the ingress port cards if flows with different 
multicast distributions are subject to the same backpressure indication. 

The complete isolation of the forwarding paths of packets with different 
fanouts does avoid HOL blocking, but is impractical for typical switch sizes, 
because it requires a separate queue for each multicast distribution that can be 
defined at each contention point in the switch (as an example, each ingress port 
card of a,n N X N switch should contain 2^ — 1 queues 0). 

The overlay approach nm isolates unicast flows from multicast flows by as- 
signing separate forwarding paths to the two types of traffic (all multicast flows 
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share the same forwarding path). When applied to the DMS, the overlay ap- 
proach succeeds in preserving the QoS guarantees of unicast flows, but obviously 
fails with multicast flows, because of the persistence of HOL blocking within their 
dedicated path. 

The Generalized DMS (G-DMS) that we present in this paper aggregates 
unicast and multicast traffic along common forwarding paths, factually han- 
dling unicast flows as single-legged multicast flows. HOL blocking still occurs for 
packets with different fanouts sharing the same queues, but no longer compro- 
mises the QoS guarantees of individual flows. One of the key elements of novelty 
of the G-DMS is in the mechanism that triggers the assertion of backpressure: 
when a packet of a multicast flow is replicated in the fabric, only one of its copies 
is counted for backpressure purposes. The QoS channel that is charged for the 
presence of the packet in the fabric is the primary QoS channel of the multicast 
flow (we refer to the channels that accommodate the other copies of the packet 
as the secondary QoS channels of the flow). The primary channel is the same 
for all packets of a given flow. Also, the primary channel of a flow can act as a 
secondary channel for a different flow. In the ingress port cards, multicast flows 
are subject to the backpressure indication coming from their respective primary 
channels. In particular, the same backpressure indication applies to multicast 
flows having different fanouts but identical primary channel. Per- flow traffic po- 
licers in the ingress port cards and the fabric capability of selectively dropping 
copies of multicast packets prevent secondary QoS channels from overflowing. In 
the port cards, the policers mark incoming packets that violate the negotiated 
profiles of the corresponding flows; then, the fabric drops all the copies of marked 
multicast packets that are destined for congested secondary channels. 

The G-DMS extends to multicast traffic all the QoS features of the DMS: 
throughput and delay guarantees for real-time flows that are compliant with 
their negotiated traffic profiles, throughput and fairness guarantees for flows 
with long-term bandwidth requirements, and fairness in the treatment of best- 
effort flows. The overhead induced by the generalization of the DMS is minimal 
(two counters for each queue in the fabric). 

The paper is organized as follows. In Section 0 we delineate the context of 
application of the G-DMS by identifying the target QoS classes and switch ar- 
chitecture, and describe the DMS in detail. In SectionOl we discuss the available 
options for counting multicast packets in the switching fabric. We then overview 
flow-control schemes based on pointer counters (Section^ and on cell counters 
(SectionEJ. We finalize the specification of the G-DMS in SectionEJ and provide 
concluding remarks in Section 0 



2 Background 

In this section we delineate the target QoS classes of the G-DMS and provide 
an overview of the underlying DMS. We also describe the algorithm that we 
use to enforce the coexistence of heterogeneous traffic components in the limited 
buffers of the switching fabric. 
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2.1 QoS Classes 

The system that we address in this paper only handles fixed-size packets, which 
we call cells. The G-DMS applies to the case of variable-sized packets as well, 
but the lack of space forces us to defer the related discussion to an upcom- 
ing paper. We refer to network-wide end-to-end cell streams as flows, and fo- 
cus on the provision of differentiated QoS guarantees to three distinct classes 
of flows: Guaranteed- Delay (GD) class. Guaranteed- Bandwidth (GB) class, and 
Best- Effort (BE) class. 

GD flows have individually specified requirements in terms of throughput 
and transmission delay. In the ATM context El, both the Gonstant Bit Rate 
(GBR) and the real-time Variable Bit Rate (rt-VBR) service categories map onto 
the GD class. 

GB flows have individual bandwidth requirements but no specified delay 
requirements (or their delay requirements are very loose). In ATM, non-real- 
time Variable Bit Rate (nrt-VBR) uirtual circuits (VG’s) map onto the GB 
class, together with Available Bit Rate (ABR) and Unspecified Bit Rate (UBR) 
VG’s with guaranteed Minimum Gell Rate (MGR). 

Finally, BE flows have no negotiated bandwidth and delay guarantees. In 
ATM, the BE class includes ABR and UBR VG’s with no specified MGR. In 
the absence of explicitly specified QoS parameters, we aim at the achievement of 
fairness in the relative treatment of BE flows: the switch should always deliver 
the same amount of traffic for BE flows that remain continuously backlogged 
over a common forwarding path. 



2.2 The Distributed Multilayered Scheduler 

The Distributed Multilayered Scheduler applies to the three-stage switch archi- 
tecture of Fig.dJ which consists of a layer of ingress port cards with large buffers, 
a switching fabric with small buffers concentrated in a single shared-memory 
module, and a layer of egress port cards with no buffers. 




Ingress Stage 
(large buffers) 



Switch Fabric Egress Stage 

(small buffers) (no buffers) 



Fig. 1. Reference switch model. 
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The interfaces between port cards and incoming/outgoing links and between 
port cards and fabric are synchronized to a common timing reference: the time 
axis is divided into timeslots of fixed duration, equal to the time needed to 
transfer a payload unit through any of the interfaces. According to the slotted 
nature of the time reference, we measure transmission rates in cells per timeslot 
(cpt) and transmission delays in timeslots. During a single timeslot, each ingress 
port card delivers no more than one cell to the fabric, and each fabric output 
transfers no more than one cell to the corresponding egress port card. 

The switching fabric has a relatively small amount of buffers and applies 
per-output selective backpressure to the ingress port cards to prevent buffer 
overflow when congestion is detected at one or more of its outputs. Most of the 
buffers are located in the ingress port cards. Selective backpressure allows to 
achieve nonblocking behavior and buffer utilization comparable with a central- 
ized shared-memory switch by making the buffers in the ingress port cards act 
as an extension of the buffers in the fabric. 

Conforming to the DMS, the fabric supplies two distinct QoS channels per 
input-output pair (i,j), for a total of 27V^ QoS channels in the shared-memory 
module. The Guaranteed- Delay (GD) channel conveys exclusively GD traf- 
fic, whereas the Non- Guaranteed- Delay (NGD) channel aggregates GB 

and BE traffic. 





Output j 



Fig. 2. The fabric-output scheduler. 

At each of the N outputs of the fabric, a modular scheduler, consisting of 
two prioritized components, distributes service to the 2N channels carrying traf- 
fic from the N switch inputs (Fig.|2|). The high-priority component, referred to 
as the Guaranteed- Bandwidth Scheduler (GBS), instantiates the Shaped Vir- 
tual Clock (Sh-VC) algorithm [TTlj . a non-work-conserving worst-case-fair GPS- 
related scheduler nmsi. The GBS satisfies exactly the minimum bandwidth 
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requirements of the QoS channels which have non-null guaranteed service rate 
(the guaranteed service rate of a channel is equal to the sum of the guaranteed 
service rates of its associated flows in the ingress port card). The low-priority 
component of the fabric-output scheduler, called the Excess Bandwidth Scheduler 
(EBS), is in charge of distributing the unused GBS bandwidth to all backlogged 
QoS channels, thus making the whole output scheduler work conserving. An in- 
stance of the Self-Clocked Fair Queueing (SCFQ) algorithm implements the 
EBS. The service-rate allocations for the QoS channels in the EBS are completely 
independent of the corresponding allocations in the GBS. A wide set of policies 
for the distribution of excess bandwidth can be emulated by properly assigning 
the EBS service rates Pj. Each QoS channel in the fabric asserts backpressure 
when its number of queued cells exceeds an associated static threshold. 

In the ingress port card, a two-level hierarchical scheduler 0 regulates access 
to the interface with the fabric (Fig. 0). At the higher level of the hierarchy, a 
GBS-EBS pair arbitrates the distribution of service among the GD, GB, and 
BE classes (no service rate is allocated to the BE class in the GBS). Distinct 
schedulers are used at the lower level of the hierarchy for servicing the flows of 
the three classes. In order to properly react to selective backpressure and avoid 
HOL blocking, each per-flow scheduler provisions separate queues for each switch 
output {virtual output queueing (VOQ) ^3)- 




Fig. 3. The ingress-port-card scheduler. 
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A work-conserving worst-case-fair GPS-related scheduler such as Shaped 
Starting Potential Fair Queueing (Sh-SPFQ) m distributes bandwidth to each 
of the N aggregates of GD traffic associated with the switch outputs ( GD virtual 
outputs). Within each GD virtual output, the individual flows get access to the 
available bandwidth through a local instance of Sh-SPFQ. 

A Weighted Round Robin (WRR) scheduler m with a common frame refer- 
ence for the whole ingress port card distributes service to the GB flows according 
to their respective bandwidth allocations. Proper reaction to the backpressure 
indication arriving from the fabric for the GB class is obtained by scheduling 
the backlogged GB flows through separate queues of flow pointers, each queue 
being associated with a distinct switch output. 

The per-flow scheduler for the BE class also instantiates the WRR algorithm, 
with all weights set to the same value in order to enforce fairness port-wide (for 
simplicity, we refer to this particular instance of the WRR paradigm as a Round 
Robin (RR) scheduler). GB and BE flows directed to the same switch output 
are subject to distinct backpressure indications. 

A common feature of a feasible implementation of the three per-flow sched- 
ulers that we have just summarized is the adoption of FIFO queues of flow 
pointers (or flow queues) in association with the respective virtual outputs. Sev- 
eral such queues are needed for the provision of delay guarantees to the GD 
flows even in the simplest implementations of the Sh-SPFQ algorithm uni, and 
at least two queues per virtual output must be deployed for implementing a 
WRR scheduler that applies a single frame reference to the whole ingress port 
card pn| . 

We refer the reader to ^ for a complete overview of the criteria for setting the 
design parameters of the DMS (such as backpressure thresholds and scheduling 
rates) and a performance evaluation of the QoS framework. 

2.3 The Excess-Bandwidth-Detection Algorithm 

Since GB and BE cells in the same input-output pair (f, j) share the same buffer 
space in channel a non-trivial mechanism is required to differentiate 

the assertion of backpressure for the two NGD traffic components. Ideally, BE 
traffic should be admitted to only when the channel has excess band- 

width available, since the GBS rate allocation is totally devoted to satisfying 
the bandwidth requirements of the GB flows. The Excess-Bandwidth-Detection 
(EBD) algorithm I j monitors the availability of excess bandwidth in the NGD 
channel to discriminate the admission of GB and BE traffic when the channel 
is not fully congested (when the channel is fully congested, meaning that its oc- 
cupation exceeds the associated backpressure threshold, backpressure is jointly 
asserted for both GB and BE traffic). 

The EBD algorithm takes advantage of the modular nature of the fabric- 
output scheduler. It counts the EBS services granted to the NGD channel as 
additional excess bandwidth, and the arrival of a BE cell to the channel as 
consumed excess bandwidth. The algorithm associates an NGD credit counter 
with each NGD channel . Whenever the NGD channel is empty. 
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the counter is set equal to the backpressure threshold The channel 

increases the NGD credit counter at every EBS service that it receives, and de- 
creases the counter at every arrival of a BE cell. When the counter becomes null, 
no more excess bandwidth is available, and the NGD channel selectively asserts 
backpressure for its BE component. In order to avoid an infinite accumulation 
of credits that could later penalize the GB component of the traffic aggregate, 
the counter is never allowed to grow above the threshold 

As we will show in the presentation of the G-DMS framework, the application 
of the EBD algorithm is not limited to flow-control schemes, but can also be 
extended to cell-dropping policies. 



3 Counting Multicast Cells in the Switching Fabric 

Our reference switch architecture replicates multicast cells exclusively in the 
fabric. The fabric stores a single copy of every incoming cell in its buffer memory. 
Then, it generates one pointer to the memory location containing the cell for 
each output in its fanout, and Anally links the pointers to the corresponding QoS 
channels. A multicast counter fj,h is associated with each location h in the buffer 
memory. The buffer- memory controller increments the value of Hhi initially null, 
every time a pointer to memory location h gets queued to a QoS channel, and 
decrements the counter every time a copy of the cell gets transmitted to an egress 
port card. When fih becomes null again, the buffer-memory controller removes 
the cell from the memory location. 

The cell-replication scheme introduces a mismatch between the total num- 
ber of cells stored in the buffer memory of the fabric and the total number of 
pointers that are queued in the QoS channels. In order to charge a single QoS 
channel for the presence of a multicast cell in the buffer memory, we designate a 
primary output for each configured flow. The primary output of a flow is arbitrar- 
ily selected among the outputs in its multicast distribution. All the remaining 
outputs in the fanout are referred to as secondary outputs. Gonsistently, for each 
cell that enters the fabric we have a primary QoS channel and a set of secondary 
QoS channels (this set is obviously empty for unicast flows). The presence of a 
cell in the buffer memory is charged to its primary QoS channel. We refer to a 
flow (cell) having channel Cij as its primary (secondary) channel as a primary 
(secondary) flow (cell) of channel Cij. 

Each QoS channel in the fabric can simultaneously act as the primary channel 
for some of the cells stored in the buffer memory, and as a secondary channel 
for other cells. We associate two distinct counters with each QoS channel Cij. 
The pointer counter /3ij keeps track of the number of cell pointers that are 
queued in Cij. The cell counter 7 *^ measures instead the number of cells in 
the buffer memory which have Cij as their primary channel (z.e., the number 
of primary cells of channel Cij that are currently stored in the buffer memory). 
The counter jij is incremented every time a primary cell of channel Cij is stored 
in the buffer memory, and decremented when the same cell is removed from the 
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buffer memory upon transmission of its last replica to any of the egress port 
cards. 

Under heavy presence of multicast traffic, the two counters typically have 
different values. Considering all the QoS channels in the switching module, the 
sum of the pointer counters is never smaller than the sum of the cell counters, 
because each buffered multicast cell may contribute multiple times to the former, 
and never more than once to the latter. 

The pointer counter is relevant to the activity of the fabric-output scheduler: 
the scheduler can select a QoS channel for service only if the associated pointer 
counter is greater than zero, independently of the corresponding cell counter. 
For the sake of backpressure assertion, on the contrary, both counters may be 
relevant. The backpressure threshold Tij of QoS channel Cij can be compared 
to either Pij or 7^ j to detect the occurrence of congestion in the channel. The 
choice of the relevant counter has major impact on the efficiency of the scheme 
that integrates unicast and multicast traffic, as we argue in the following sections. 



4 Flow Control with Pointer Counters 

In this section, we overview three flow-control schemes that rely on pointer coun- 
ters to identify congested QoS channels, and show how they all fail in supporting 
the QoS requirements of multicast flows. We obtain clear indication that cell 
counters are more promising than pointer counters as a basis for flow control in 
the switch fabric. We will start the investigation of flow-control schemes based 
on cell counters in Section]^ below. 



4.1 Congestion Detection at a Single Output 

The ingress port card maintains a permanent association between a multicast 
flow and its primary virtual output (i.e., the virtual output of its primary QoS 
channel), while in the fabric the QoS channels assert backpressure based on the 
comparison of the pointer counters with the associated backpressure thresholds. 

This approach allows to control the contribution of a flow to the occurrence 
of congestion at its primary output, but has no means to prevent its secondary 
outputs from overflowing. In fact, if a primary output is not congested, the fabric 
never applies backpressure to the corresponding virtual output in the port card, 
so that the flow can keep sending a continuous stream of cells to the fabric, 
disregarding the heavy congestion possibly induced at its secondary outputs. 



4.2 Congestion Detection at Multiple Outputs 

For each multicast flow, the ingress port card applies basic logical operators 
{AND, OR) to the portion of backpressure bitmap that overlaps with the fanout 
(we assume that a backpressure bit is set to 1 when the pointer counter of the 
corresponding channel exceeds the backpressure threshold) . The idea is to control 
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the activity of a multicast flow using some sort of aggregation of the information 
that is available on the congestion occurring over its fanout. 

Both logical operators fall short of a balanced control of multicast trafflc. 
The logical OR of the relevant backpressure bits is too conservative, because the 
occurrence of congestion at a single secondary output of flow fk is sufficient to 
stop its activity. This behavior also induces heavy HOT blocking on the flows that 
are queued behind fk in the ingress-port-card scheduler and have no connection 
with the congested secondary output of fk- The logical AND, on the other hand, 
makes the flow control too loose: the presence of a single uncongested output in 
the fanout allows the multicast flow to keep sending traffic to all other outputs, 
thus overflowing the fabric. 

4.3 Virtual-Output Rotation 

The flow-control mechanism for multicast traffic presented in |S| inspires this 
solution. Instead of maintaining a permanent association with a specific output 
in the fanout, the multicast flow cyclically travels within the ingress port card 
over the virtual outputs of its multicast distribution, switching from one flow 
queue to the following one in the fanout at every service it receives from the 
port-card scheduler. If the flow ever contributes to clogging one of its outputs, it 
ends up being restricted by backpressure when it visits the queue of the congested 
virtual output. 

The virtual-output rotation avoids buffer overflow, but lacks accuracy in the 
provision of QoS guarantees. A flow with dense multicast distribution and only a 
few congested outputs can unfairly subtract considerable amounts of bandwidth 
from other flows insisting on the same congested outputs, possibly inducing 
heavy violations of their throughput and delay guarantees. 

We illustrate the problem with a simple example and with the support of 
Fig .0 Flows fi and /2 have the same bandwidth allocation, and fanouts with the 
same cardinality but only output j in common. All the outputs in the fanout of 
flow fi are congested, whereas output j is the only congested output in the fanout 
of flow f2- The channel scheduler at the fabric output and the virtual-output 
scheduler in the ingress port card guarantee the transfer of traffic from virtual 
output j to the fabric at a rate that is not smaller than the sum of the bandwidth 
allocations of fi and f2- The presence of congestion at output j, on the other 
hand, forces the aggregate rate received by the two flows to be not greater than 
the sum of their bandwidth allocations. Under these constraints, the bandwidth 
requirements at output j are satisfied for both flows only if they cycle through 
their respective fanouts at exactly the same frequency. However, /2 cycles much 
faster than fi because of the immediate services it receives at all outputs different 
than j. As a consequence, flow /2 receives more services than flow /i at virtual 
output j. Since /i has no way to recover the loss of bandwidth at virtual output j, 
its throughput and delay guarantees are irremediably compromised. 
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Fig. 4. Violation of QoS guarantees with virtual-output rotation. 

5 Aggregation of Congestion Information 
over the Multicast Fanout 

In the flow-control schemes that we have addressed in Section 0 the assertion of 
backpressure from a given QoS channel depends exclusively on local information, 
i.e., on the amount of cell pointers currently queued in the channel. When this 
information becomes available at the ingress port card, it is already too late to 
extract additional elements that could be useful in the regulation of multicast 
traffic. The aggregation of such elements should rather occur in the fabric, upon 
generation of the backpressure bitmap. 

In this section, we define a flow-control scheme that uses cell counters to 
aggregate information on the state of congestion induced by multicast flows on 
their output distributions. The scheme clearly outperforms the algorithms that 
are based on pointer counters, but still fails in supporting robust QoS guarantees, 
as we show with a practical example. The same example also provides directions 
for refining the flow-control scheme based on cell counters. We will adopt the 
appropriate refinements in the next section, where we complete the specification 
of the G-DMS. 

5.1 Flow Control with Cell Counters 

A flow-control scheme that is based on cell counters instead of pointer counters 
does aggregate information on the congestion level of multiple outputs. In the 
fabric, a cell contributes to congesting its primary channel until its transmission 
is completed over all the outputs of its fanout. If some of the secondary outputs 
are congested, the cell takes longer to leave the fabric, and thus increases the 
probability of backpressure assertion from its primary channel. In the ingress 
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port card, a consistent use of the backpressure indication derived from the cell 
counters of the primary channels requires that each flow be always scheduled 
through its primary virtual output. 

In every scheduler of the switch, the allocation of bandwidth to the traffic 
aggregates that include a multicast flow conforms to the following guidelines: 
(i) for a GD flow in the ingress port card, the guaranteed bandwidth of the 
flow adds to the aggregate service rate of its primary virtual output in the GD 
virtual-output scheduler (the GB class has no virtual-output scheduler, as shown 
in Fig. H; (ii) for a GD (GB) flow in the fabric, the guaranteed service rate of 
the flow contributes to the GBS service rates of all the GD (NGD) channels in 
its fanout. 

The flow-control scheme based on cell counters does not expose the fabric to 
buffer overflow, but still leaves some open issues in the provision of robust QoS 
guarantees to multicast flows. It may happen, for example, that the cell counter 
of a channel Cij exceeds the corresponding backpressure threshold when the 
pointer counter of the same channel is null. In this case, traffic destined for 
output j can be forced to wait in ingress port card i even if output j is lightly 
loaded or not loaded at all. Two distinct conditions can lead to this manifestation 
of HOL blocking, which determines loss of aggregate throughput in the whole 
switch: (i) the actual load at channel Cij is below its nominal level, as expressed 
by the GBS service rate of the channel; or (ii) the secondary outputs of some 
of the primary flows of C'ij are overloaded by flows whose primary outputs are 
different than j. In the former case, the individual guarantees of the primary 
flows of Cij can still be enforced, even though the aggregate switch throughput 
is no longer maximized. In the latter case, the throughput and delay guarantees 
of the primary flows of the channel can be seriously compromised. This second 
condition is particularly critical for GD traffic, which typically has stringent real- 
time requirements. The following subsection illustrates the critical case with an 
example. 

5.2 Violation of QoS Guarantees with Cell Counters 

We consider a 4x4 switch with the flow setup summarized in Tabled All flows in 
the switch belong to the GD class and have an allocated service rate that is equal 
to 0.01 cpt. The number of flows that are configured for each class determines the 
nominal load at the switch inputs and outputs, and the bandwidth allocations at 
each stage of the distributed scheduler. A regulated source is a traffic source that 
complies with an associated traffic profile {e.g., a token-bucket regulator). An 
unregulated source does not comply with its traffic profile, and sends traffic to the 
switch at the highest rate allowed by the capacity of the input link, compatibly 
with the presence of other flows at the same input. An idle source sends no 
traffic to the switch, even if the corresponding flow has a bandwidth allocation 
in the ingress-port-card and fabric-output schedulers. A square around an 
output number x identifies the primary output of the flow. 

Given the traffic setup, and in particular the idling behavior of the sources of 
class C 4 , the nominal and actual loads at the switch inputs and outputs reflect the 



Integrated Provision of QoS Guarantees to Unicast and Multicast Traffic 



373 



Table 1. Critical traffic setup. 



Flow 

Class 


Number 
of Flows 


Source 

Behavior 


Input 


Output (s) 


Co 


74 


Unregulated 


1 


0 




Cl 


49 


Unregulated 


2 


2 




C2 


25 


Unregulated 


0 


1 


2 3 


C3 


25 


Regulated 


0 


0 


2 


C4 


49 


Idle 


0 


1 


3 



distribution reported in Table 0 The idling status of the flows of class C4 makes 
all the bandwidth allocated to virtual output 1 at ingress port card 0 available to 
the flows of class C2 . The allocation of bandwidth for the flows of class C4 is also 
available to the flows of class C2 at QoS channels (the primary channel of 
class C2) and (one of the secondary channels of class C2). As a consequence, 
the flows of class C2 have access to the fabric at a higher rate than their nominal 
allocation (through virtual output 1 at input 0), and experience no contention 
to access outputs 1 and 3 in the fabric. The pointer counters f and f are 
therefore null for most of the time. Output 2, on the contrary, is heavily loaded. 
In particular, channel is serviced at exactly its nominal GBS allocation, 

but is offered secondary cells from classes C2 and C3 at a rate that exceeds by 
far its GBS allocation. Whenever the secondary cells of class C2 have access to 
channel at a rate higher than the nominal allocation for the corresponding 
flows, the flows of class C3 experience a loss of throughput. We have punctually 
observed this behavior in the simulation of the traffic scenario. 



Table 2. Load distribution over the switch inputs and outputs. 





Inputs 


Outputs 




0 


1 


2 


3 


0 


1 


2 


3 


Nominal Load 


0.99 


0.74 


0.49 


0.00 


0.99 


0.74 


0.99 


0.74 


Actual Load 


1.00 


1.00 


1.00 


0.00 


1.25 


0.75 


2.00 


0.75 



Out of the bandwidth that is totally available at channel Co , 2 (0.505 cpt), 
only 0.205 cpt go to class C3 (about 80% of the nominal allocation of the class), 
and 0.300 cpt are taken by class C2. The evident loss of throughput also implies 
the violation of the theoretical delay bounds for the flows of class C3. The ag- 
gregate switch throughput, computed as the ratio between the number of cells 
delivered to the egress port cards and the total number of deliverable cells, is only 
equal to 0.743, because of the under-utilization of outputs 1 and 3 (0.300 cpt, in 
spite of the availability of 0.750 cpt). 

The observation of the simulation results leads to interesting conclusions 
about the possible refinements of the flow-control scheme for the generalization 
of the DMS framework. First, the loss of throughput for class C3 is caused by 



374 



A. Francini and F.M. Chiussi 



the excessive amount of traffic admitted to the fabric for class C 2 - At the same 
time, the aggregate switch throughput suffers from the exact opposite reason, 
which is the under-utilization of two of the outputs in the fanout of class C 2 - 
The only way to admit less traffic of class C 2 to output 2 and more traffic of the 
same class to outputs 1 and 3 is differentiating the admission to the fabric for 
copies of the same cells of class C 2 - In other words, we should be able to admit 
all copies aiming for outputs 1 and 3, and deny access to some of the copies 
destined for output 2. In a multicasting scheme that replicates cells exclusively 
in the switching module, denying access to a replica is equivalent to dropping 
the replica in the fabric. 

6 The Generalized Distributed Multilayered Scheduler 

The Generalized Distributed Multilayered Scheduler (G-DMS) extends the DMS 
framework to packet switches that simultaneously handle both unicast and multi- 
cast traffic. The G-DMS rejects the overlay approach and uniforms the treatment 
of unicast and multicast traffic, so that the former can indeed be considered as 
a particular case of the latter. The generalization of the DMS framework must 
cope with the interaction of multicast traffic with selective backpressure, sub- 
ject to the architectural constraint that unicast and multicast flows are scheduled 
through the same flow queues in the ingress port cards. 

6.1 Dropping Multicast Copies in the Fabric 

The capability of selectively dropping copies of multicast cells in the switch 
fabric is critical to the definition of the G-DMS. In this subsection, we ponder 
the transport-layer implications of this option. 

For a GD flow, whose transport over the packet network typically relies on 
connectionless datagram protocols (such as UDP), it is desirable to deliver pack- 
ets to all the leaves of the multicast tree with maximum speed and minimum 
difference between the delivery times at distinct destinations. When the multi- 
cast distribution tree has branches that are more congested than others, it may 
be acceptable to drop packets on the congested branches, especially if this ac- 
tion is a necessary condition to preserve the desired quality of service on the 
uncongested branches and maximize the aggregate switch throughput. 

For an adaptive flow, which runs end-to-end flow control at the transport 
layer (as provided in the unicast case by TGP) and may or may not have ex- 
plicit allocation of bandwidth resources at the network nodes (as is the case for 
our GB and BE classes, respectively), dropping copies of a multicast packet in 
the switch fabric is not different than dropping the same copies downstream in 
the network. The same end-to-end mechanism that ensures reliable delivery of 
multicast packets to all expected destinations can therefore be invoked to recover 
from losses in the switch fabric. 

We conclude that dropping part of the fanout of a multicast cell in the switch 
fabric does not collide with the requirements of both real-time and adaptive 
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multicast applications, and is compatible with the mechanisms that support 
such applications throughout the packet network. 

6.2 Flow Control with Cell Counters and Selective Cell Dropping 

The G-DMS relies on backpressure and selective discard of multicast copies to 
regulate access to the fabric buffers. Per- flow policers in the port cards contribute 
to the fairness of the access regulation. 



Per-Flow Policers. Referring again to the traffic scenario specified in Table Q 
it is clear that the source of violation for the QoS guarantees of class C 3 is in the 
excess of secondary cells of class C 2 that are admitted to channel C §2 ■ Some of 
these cells should be denied access to the fabric, but the basic flow-control scheme 
with cell counters has no means to explicitly regulate access to the secondary 
channels. We therefore need to upgrade the scheme, introducing means to control 
the accumulation of secondary cells in the channels. In particular, we would 
like to restrict admission to the fabric only for class C 2 (whose behavior is not 
consistent with the bandwidth allocation), and maintain the channel fully open 
to the flows of class C 3 . 

The per-flow policers that are typically available at the ingress port cards 
constitute an excellent instrument for detecting any discrepancy between the 
actual and the expected behavior of the configured flows. A policer is a device 
that monitors the profile of the incoming traffic on a per-flow basis, and compares 
it with the traffic profiles specified upon configuration of the flows. If an incoming 
packet falls out of profile, the policer marks it, so that it has higher probability 
of being dropped than an unmarked packet when they both arrive to a congested 
node in the network. The Generic Cell Rate Algorithm (GCRA) 1111 provides a 
standard solution for checking conformance of a flow with a token-bucket profile 
in ATM networks. Similar policing devices can be easily defined for networks 
with variable-sized packets 

The G-DMS exploits the action of the policers at the ingress port cards 
to mark incompliant cells and expose them to access denial to the congested 
channels of the switching fabric. 



Static Admission Thresholds for Secondary Channels. We make a first 
attempt to complete the definition of the G-DMS with a scheme that associates 
a static admission threshold with each secondary channel. The channel drops the 
secondary copies of incompliant cells that arrive when its pointer counter exceeds 
the admission threshold. After implementing the algorithm, we observe substan- 
tial but not yet satisfactory improvements compared to the scheme with no cell 
dropping. Under the usual traffic scenario, the throughput of class C 3 increases 
from 0.205 cpt to 0.222 cpt (ideally it should be 0.250 cpt), and the aggregate 
switch throughput grows from 0.753 to 0.912 (the target is obviously 1.000). 

The threshold-based admission policy fails in setting a clear discrimination 
between compliant and incompliant cells. Having the pointer counter below the 
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admission threshold does not necessarily imply that the channel has bandwidth 
available for the transmission of incompliant secondary cells. As a consequence, 
such cells can still be admitted to the channel in excess of the desirable amount 
and thus compromise the QoS guarantees of compliant flows. 



Multicast Credits for Secondary Channels. The EBD algorithm that we 
have reviewed in Section 12.31 allows the secondary channels to accurately detect 
the availability of bandwidth in excess of the amount that strictly satisfies the 
requirements of compliant flows. Conforming to the EBD algorithm, we equip 
each QoS channel Cij with a multicast credit counter Xi,j for incompliant cells. 
When the pointer counter of the channel is null, the credit counter Xi,j is set 
equal to a multicast credit threshold which provides a common reference for 
all the channels in the switching module. The multicast credit counter Xi,j is 
decremented every time the channel admits an incompliant secondary cell, and 
incremented every time the channel receives an EBS service. If the counter hits 
the threshold it is not allowed to grow any further. If the counter becomes null, 
the channel discards all incoming secondary cells that are marked as incompliant. 
It should be noticed that the admission policy has no effect on incompliant 
primary cells, which are taken care by the backpressure mechanism based on the 
cell counters. When one of such cells arrives to the channel, it is always accepted. 

We have integrated the algorithm in the G-DMS, and applied it to the traf- 
fic scenario of Table El The throughput observed for class C 3 at channel 
matches the expected 0.250 cpt, with consequent satisfaction of the delay bound 
expectations for all the flows in the class. The aggregate throughput measured 
in the switch is 0.996. The key for the achievement of all the QoS goals is in 
the reduction to the expected value of 0.250 cpt for the throughput of class C 2 at 
output 2, and in the full utilization of outputs 1 and 3 (the measured throughput 
is 0.743 cpt at both outputs). 



Multicast Credits in Secondary NGD Channels. In order to complete 
the specification of the G-DMS, we must define the way we handle the GB and 
BE classes in the NGD channels. In particular, we must determine whether the 
EBD algorithm that discriminates the assertion of backpressure for primary GB 
and BE flows should use the cell counter or the pointer counter to track the 
credits of the NGD channel. 

The EBD algorithm detects the availability of excess bandwidth in the fabric- 
output scheduler, and the activity of the scheduler is controlled by the pointer 
counters of the channels (for the scheduler, an idle QoS channel is a channel 
whose pointer counter has null value). For this reason, we decide to increment 
the credit counter of the NGD channel every time the channel receives an EBS 
service, and decrement the counter every time the channel receives a BE cell 
pointer. Notice that we still base the detection of congestion in the NGD channel, 
which triggers the joint assertion of backpressure for GB and BE flows, on the 
value of the cell counter. 
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By running extensive simulations with GB and BE flows in the critical trafflc 
scenario of Table Q we have observed the systematic achievement of all QoS 
expectations. 

6.3 Cell Counters versus Pointer Counters in the G-DMS 

With the introduction of the mechanism that selectively drops secondary cells 
in QoS channels with lack of excess bandwidth, the use of cell counters instead 
of pointer counters in the flow-control scheme for primary flows may appear to 
be harder to motivate. We argue that there are at least three clear reasons to 
maintain the role of the cell counters in the flow-control scheme for primary 
flows. 

First, the amount of cells that can be globally admitted to the switching 
module is much higher when admission is regulated by the cell counters, because 
each cell contributes to the cell count only at the primary channel, with no 
contribution to the cell count at the secondary channels. 

Second, for real-time traffic, the preservation of tight delay guarantees re- 
quires that the transfer of cells from the virtual outputs in the ingress port 
cards to the QoS channels in the fabric be as smooth as possible 0. In order to 
maintain the distributed scheduler in close approximation of its ideal hierarchical 
reference, the virtual outputs should never experience long periods of time char- 
acterized by continuous assertion of backpressure from the corresponding QoS 
channels. Even if all flows configured in the system are compliant with their traf- 
flc profiles, the pointer counters can still undergo rapid fluctuations determined 
by the arrival of secondary cells whose flows are controlled by different primary 
channels. The cell counters, on the contrary, Alter out these fluctuations, making 
the virtual-output perception of the state of the QoS channels much more stable 
than in the case of use of pointer counters. 

Third, in the presence of selective discard of multicast cell copies, the en- 
forcement of robust QoS guarantees relies on the accuracy of the mechanism 
that allocates bandwidth resources and sets the policing parameters for the con- 
figured flows. With pointer counters, any over-allocation of resources may induce 
buffer overflow in the fabric. With cell counters, on the contrary, the overall oc- 
cupation of the buffer memory is always under control, with no risk of buffer 
overflow. The simulation results presented below substantiate our arguments. 

Cell Counters versus Pointer Counters: Simulation Results. The sim- 
ulation in a 3 X 3 switch of the traffic scenario summarized in Table 0 illustrates 
the benefits of using cell counters instead of pointer counters in the assertion of 
backpressure for primary flows in the G-DMS (all flows in the table belong to 
the GD class). Given the traffic setup, and in particular the unregulated nature 
of the sources of classes Ci and C 2 , the nominal and actual loads at the switch 
inputs and outputs reflect the distribution reported in Table 0 Output 1 of the 
switch is overloaded by the presence of unregulated traffic coming from input 1 
(class C 2 ). Similarly, the capacity of input 0 is saturated by the presence of un- 
regulated sources sending traffic to output 2 (class ci). The traffic conditions at 
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Table 3. Traffic setup showing the benefits of cell counters versus pointer counters. 



Flow 

Class 


Allocated 
Rate (cpt) 


Number 
of Flows 


Source 

Behavior 


Bucket Size 
(cells) 


Input 


Output (s) 


Co 


0.01 


25 


Regulated 


2 


0 


1 




Cl 


0.01 


50 


Unregulated 


100 


0 


2 




C2 


0.01 


50 


Unregulated 


100 


1 


1 




C 3 


0.0001 


2400 


Regulated 


2 


0 


0 


1 



input 0 and output 1 critically restrict access to QoS channel Cg f, which aggre- 
gates unicast cells of class cq and secondary multicast cells of class C3. Flows of 
class C3 are strictly regulated (the leaky-bucket size is 2 cells), but their number 
is large, and they experience no contention at their primary output (output 0 is 
lightly loaded). As a consequence, the queue length of channel Cg f can undergo 
wide fluctuations determined by the bursty arrival of compliant secondary cells 
of class C3. The unicast flows of class cq are heavily penalized by these fluctua- 
tions when the assertion of backpressure depends on the pointer counters. The 
use of cell counters, on the contrary, makes the admission of secondary cells to 
channel much smoother. 

In the simulation experiment, we observe that the overall throughput for 
class Co is 99.99% of the offered load when backpressure depends on cell counters, 
and 93.71% of the offered load when backpressure depends on pointer counters. 
The worst-case transmission delay observed for class cq is equal to 198 timeslots 
with cell counters, and is instead unstable when the pointer counters drive the 
assertion of backpressure (as an example, at the end of a 10,000,000-timeslot 
simulation run we have recorded a worst-case delay of 650,736 timeslots). 



Table 4. Load distribution over the switch inputs and outputs. 





Inputs 


Outputs 




0 


1 


2 


0 


1 


2 


Nominal Load 


0.99 


0.50 


0.00 


0.24 


0.99 


0.50 


Actual Load 


1.00 


1.00 


0.00 


0.24 


1.49 


0.51 



6.4 G-DMS Overhead 

The G-DMS upgrades the flow-control functionality of the basic DMS while 
keeping its scheduling features unchanged. Table 0 lists the per-channel coun- 
ters involved in the implementation of the G-DMS. The only overhead induced 
by the generalization of the basic DMS is given by the introduction of the 
cell counter jij and the multicast credit counter Xij at each QoS channel. No 
changes affect the number of QoS channels, the size of the backpressure bitmap, 
and the fabric-output and ingress-port-card schedulers. 
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Table 5. Per-channel counters in the G-DMS. 



Channel 


Pointer Counter 


Cell Counter 


NGD 


Multicast 








Credit Counter 


Credit Counter 


GD 


P?f 




N/A 




NGD 


oNGD 




a^GD 


yNGD 



7 Concluding Remarks 

We presented the Generalized Distributed Multilayered Scheduler (G-DMS), a 
framework for the integrated provision of differentiated QoS guarantees to uni- 
cast and multicast flows in multi-stage packet switches. Two distinguishing fea- 
tures characterize the G-DMS: (i) a flow-control scheme that regulates access to 
the switching fabric based on the actual occupation of the buffer memory and 
not on the number of queued cell pointers, and (ii) the capability of selectively 
dropping copies of multicast cells that violate the traffic profiles of the corre- 
sponding flows. The algorithm that triggers the selective dropping of cell copies 
in the fabric exploits the modular nature of the schedulers at the fabric outputs 
and the presence of traffic policers in the ingress port cards. 

The G-DMS meets the QoS expectations of both unicast and multicast flows 
while adding only minimal overhead to the implementation complexity of the 
underlying DMS (two counters per QoS channel). 
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Abstract. The current best-effort infrastructure in the Internet lacks 
key characteristics in terms of delay, jitter, and loss, which are required 
for multimedia applications (voice, video, and data). Recently, significant 
progress has been made toward specifying the service differentiation to 
be provided in the Internet for supporting multimedia applications. In 
this paper, we identify the main traffic types, discuss their characteristics 
and requirements, and give recommendations on the treatment of the dif- 
ferent types in network queues. Simulation and measurement results are 
used to assess the benefits of service differentiation on the performance 
of applications. 



1 Introduction 

The Internet is seeing the gradual deployment of new multimedia applications, 
such as voice over IP, video conferencing, and video-on-demand. These appli- 
cations generate traffic with characteristics that differ significantly from traffic 
generated by data applications, and they have more stringent delay and loss 
requirements. Voice quality, for example, is highly sensitive to loss, jitter, and 
queueing delay in network node (i.e., switch or router) buffers, and the quality of 
video traffic is significantly degraded during network congestion episodes. The 
current best-effort infrastructure of the Internet is ill-suited to the quality of 
service requirements of these applications. 

In addition to emerging streaming applications, the Internet must also sup- 
port interactive data applications. Good user-perceived performance of these 
applications, such as telnet, video gaming, and web browsing, requires short 
response times and predictability. However, these requirements are often not 
met, due to the interaction of TCP with packet loss during network congestion 
periods. 

In this paper, we identify the main traffic types, discuss their characteristics 
and requirements (Sec. 2), and examine the degree of separation of traffic types 
necessary to provide adequate user-perceived performance (Sec. 3). In addition, 
we describe a conceptual model for a network node port that provides service 
differentiation and discuss the different required functionalities (Sec. 4). Within 
this model, we demonstrate that the provision of multiple packet drop priorities 
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within a queue, in association with appropriate packet marking, can further 
enhance performance (Sec. 5). Throughout, simulation and measurement results 
support our proposals. 

2 Multimedia Traffic Characteristics and Requirements 

In this section we present the characteristics and requirements of voice, video, 
and interactive data applications. For each application, we discuss its character- 
istics in terms of data rate and variability, and we describe its requirements in 
terms of delay, jitter, and packet loss. These have to be well understood in order 
to determine the appropriate treatment to give each application in the network. 



2.1 Voice 

Voice connections generate a stream of small packets of similar size (a few tens 
of bytes) at relatively low bit rates. Typical stream rates range from 5 Kbps to 
64 Kbps, depending on the encoding scheme, to which header overhead adds a 
few tens of Kbps. Therefore, voice stream rates remain on the order of tens of 
Kbps, regardless of the encoding scheme. For example, G.711, a simple pulse 
code modulation encoder, generates evenly spaced 8-bit samples of the voice 
signal at 125 msec intervals, resulting in a 64 Kbps stream. It is possible to 
reduce the rate through voice compression schemes and silence suppression, at 
the expense of increased variability. The suppression of samples corresponding 
to silence periods (which account for 45% of total time in typical conversations 
0), leads to substantial average rate reduction. For example, G.729A generates 
an 8 Kbps stream, while G.723 generates a 5.33 Kbps stream. 

For the Internet to provide toll quality voice service, packet delay and loss 
must meet stringent requirements. Interactivity imposes a maximum round trip 
time of 200-300 msec. That is, the one-way delay incurred in voice encoding, 
packetization, network transit time, de-jittering, and decoding must be kept 
below 100-150 msec. Jitter must be limited (e.g., less than 50 msec) to ensure 
smooth playback at the receiver. Subjective tests have shown that periods of 
lost speech (clips) larger than 60 msec affect the intelligibility of the received 
speech 0 . Since packet loss in the Internet is bursty 022 , the probability that 
consecutive voice packets are lost, resulting in long clips, is significant. Therefore, 
loss rates have to be kept at very low levels unless packet loss concealment is 
used. 



2.2 Video 

Video traffic is stream-oriented and spans a wide range of data rates, from tens 
of Kbps to tens of Mbps. The characteristics of encoded video (data rates and 
variability in time) vary tremendously according to the content, the video com- 
pression scheme, and the video encoding scheme. 
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In terms of content, more complex scenes and more frequent scene changes 
require more data to maintain a certain level of quality. For example, video 
streams of talking heads are lower-rate and less variable than those of motion 
pictures and commercials. 

Different video compression schemes, such as H.261, H.263, MPEG-1, and 
MPEG-2, are designed to meet different objectives and therefore have different 
bit rates and stream characteristics. For instance, the applications of H.261 in- 
clude video conferencing; consequently, the rates are multiples of 64 Kbps, up 
to 2 Mbps, and the coding is designed to achieve a fairly uniform bit rate across 
frames. On the other hand, prerecorded movies using MPEG-2 may have several 
times the picture resolution and typically require several Mbps. 

The video characteristics are also affected by the video encoding control 
scheme used. For a given content and a given compression scheme, constant bit 
rate (GBR) video maintains a streaming rate that varies little over time. By 
contrast, variable bit rate (VBR) video traffic has been shown to be self-similar 
and may have a peak rate which is many times the average. Typically, VBR 
aims to achieve a more consistent quality for the same average bandwidth, and 
it is more commonly employed in practice. 

The latency requirements of video depend on the application. Like voice, 
interactive video communication requires low delay (200-300 msec round-trip); 
however, one-way broadcast and video-on-demand may tolerate several seconds 
of delay. As is the case for voice, a packet delayed beyond the time when it 
needs to be decoded and displayed is considered lost. Furthermore, it has been 
observed that packet loss rates as low as 3% can affect up to 30% of the frames, 
due to dependencies in the encoded video bit stream 0. Therefore, the packet 
loss rate and delay in the network should be kept small [3- 

2.3 Interactive Data Applications 

Data applications still account for the large majority of Internet traffic m and 
have significantly different characteristics and requirements. We focus here on 
interactive data applications, namely telnet (remote login) and the web. 

Typical telnet sessions consist of characters being typed by a user at a ter- 
minal, transmitted over the network to another machine (server), which echoes 
them back to the user’s terminal. The packet stream being generated consists 
of small datagrams (typically less than 50 bytes). Occasionally, the results of 
commands typed by the user are sent back by the server. This results in asym- 
metric traffic, with server to user terminal traffic on average 20 times the user to 
server traffic m- Packet interarrival times have been found to follow a heavy- 
tailed (Pareto) distribution, resulting in somewhat bursty traffic |23j- However, 
the inter-packet time is normally limited by the typing speed of humans, which 
is rarely faster than 5 characters per second giving a minimum 200 msec 
average inter-packet time. Gonsequently, the traffic generated by telnet is of 
relatively low bandwidth and low burstiness. 

Telnet is a highly interactive application and, similarly to voice over IP, has 
strict delay requirements on individual packets. Echo delays start to be notice- 
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able when they exceed 100 msec, and in general, a delay of 200 msec is the limit 
beyond which the user-perceived quality of the interactivity suffers m- Further- 
more, telnet traffic is highly sensitive to packet loss, since the retransmit timeout 
required for recovery has a minimum value that exceeds the maximum accept- 
able echo delay. Therefore, telnet packet loss needs to be kept at a minimum for 
best user-perceived performance. 

Web traffic, carried by HTTP over TCP, is closely tied to the contents of 
web pages and to the dynamics of TCP. Trace studies of web traffic have shown 
that the majority of HTTP requests for web pages are smaller than 500 bytes. 
HTTP responses are typically smaller than 50 KB, but may also be very large 
when HTTP is used to download large files off web pages PH]. Indeed, HTTP 
responses have been found to follow a heavy tailed distribution, corresponding 
to that of web files in the Internet. Moreover, the aggregate traffic generated by 
many users of the WWW has been shown to exhibit self-similarity isczi. 

In general, short page download times (less than 5 seconds) are required for 
good user-perceived performance. In addition, users highly value the predictabil- 
ity of web response times. In other words, not only does the average download 
time need to be small, but so does the variance of download times. In this con- 
text, it is important to distinguish between the interactive use of HTTP, i.e. for 
downloading actual web pages (html file and images) which tend to be short 
transfers (and therefore have low rate), and the non-interactive FTP-like use, 
where HTTP is employed to download large files. 

3 Mixing vs. Separating Traffic Types 

As discussed above, voice, video, and data applications differ significantly in 
their traffic characteristics and requirements. Naturally, we are interested in 
understanding how we can support the various traffic types in a single network, 
such that the user-perceived performance is maximized. It appears reasonable 
that identifying different classes of traffic in order to separate them at the queues 
in the network and to treat them appropriately would achieve this goal. The 
questions that arise are: Where in the network would differential handling be 
necessary, if at all? Which types need to be separated, and which types can be 
safely mixed? What is the appropriate treatment of each class in its queue? In 
this section we attempt to answer the first two questions, and we address the 
third in Sec. 4 and 5. 



3.1 Mixing Voice and Data 

In j 1 . we investigate the effect of mixing and separating traffic types, i.e., which 
types can be mixed together in the same queue without incurring a significant 
loss in throughput, and which types need to be separated to meet performance 
objectives. We first consider the impact of data traffic on voice traffic by mixing 
1 Mbps of voice traffic (II streams) with TCP data traffic. We determine the 
maximum number of data sources that can be mixed with voice traffic if voice 
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packets are to satisfy a 10 ms delay budget allocated to the section of the path 
being studied. We consider paths composed of either Tl, lOBase-T, or lOOBase- 
T links. Simulations reveal that mixing voice and data traffic is impossible for 
Tl links (1.5 Mbps). In the case of lOBase-T links, it is only possible for less 
bursty data flows, at the expense of a significant throughput reduction. Mixing 
FTP traffic with voice is possible only on 100Base-T links, in which case the 
link utilization must be kept below 20%. This result shows that data traffic is 
incompatible with voice traffic and should be separated from it. 

To corroborate these simulation results, we examine m the transmission 
of voice on a VPN path between two Internet POPs in the U.S., coast-to-coast 
during business hours. Using delay measurements from this path, we simulate the 
quality of G.711 VoIP calls. In Fig. P we show the voice quality experienced, 
quantified by the MOS (Mean Opinion Score) scale. There are several series, 
each corresponding to different echo loss (echo cancellation) capabilities. EL=inf 
corresponds to perfect echo cancellation, while EL=31 corresponds to poor echo 
cancellation. Given that toll quality voice has a MOS of 4-5, we observe that 
even for good echo loss (EL=41, 51), there are many times in the course of the 
hour when quality is poor, due either to packet loss or excessive delay. We also 
note the importance of echo cancellation capabilities in contributing to voice 
quality. 




Fig. 1. MOS, averaged over 5-second intervals, for different echo cancellation capabil- 
ities. 



We now simulate 1000 G.711 calls of exponential duration in the same envi- 
ronment. The playout deadline is optimized every 15 sec, and the echo cancel- 
lation is very good (EL=51). Even with this favorable setup over the path, we 
observe in Fig. El that the quality experienced is much worse than toll quality: 
around 10% of the calls experience at least one minute of poor quality (the MOS 
drops below 4), and 4% of all calls experience poor quality in at least 10% of 
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their minutes. Note that these results account only for voice transit between two 
POPs; end-to-end, the quality suffers yet further. Clearly, voice and data traffic 
should be separated into different traffic classes. 




Fig. 2. Percentage of calls with MOS less than a particular value. 



3.2 Mixing Voice and Video 

Now that we have established the need for at least two queues, we consider 
whether voice and video can be mixed. First, we note that CBR and VBR video 
have very different characteristics, so we consider them separately. 

When CBR video is mixed with voice, the increase in voice delay is con- 
tained because the CBR video streams are well behaved. Our work uni shows 
that since voice and CBR video have similar characteristics and real-time delay 
requirements, we may allow them to be handled together. Consider a scenario 
in which CBR video streams are added to a 7-hop lOBase-T network carrying 
a 450 Kbps aggregate voice load. Voice delay exceeds 10 ms only when the link 
approaches full utilization. However, on a T1 link (1.5 Mbps), to achieve the tar- 
get performance, only one video stream and two voice streams can be admitted, 
resulting in utilization as low as 55%. Therefore, unless the rate of the video 
stream is large compared to the link bandwidth, CBR video does not hurt voice 
traffic. 

Consider the same scenario, with CBR video replaced by VBR video. As 
expected, given the characteristics of VBR video, the results of its mixing with 
voice traffic differ greatly from those of CBR video. Figure 0 shows the distri- 
bution of delay of voice and video packets when they are mixed and when they 
are separated. As we increase the number of multiplexed VBR video streams, 
both voice and video delays increase rapidly because of the long bursts that 
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get injected into the queue. This implies that if latency constraints are to be 
met, only a small number of video streams may be mixed with voice, resulting 
in low network utilization. In general, the achievable throughput depends on 
the voice/video mix, the burstiness of the video streams, and the link speeds; 
however, it is clear that VBR video cannot be mixed successfully with voice. 



Loss Rate 

lOBase-T, 7 hops, VBR video stream (730 Kb/s average, 
2.3Mb/s peak), 450 Kb/s aggregate voice ioad 




Fig. 3. Delay distributions for voice and VBR video. Video streams are added to the 
network. 



3.3 Mixing Video and Data 

We have shown that VBR video and voice are incompatible. The next question 
we attempt to answer is: can data and VBR video be mixed? It turns out that the 
answer depends on what delay can be tolerated by the video and what aggregate 
throughput is considered acceptable. In general, when the delay bounds are 
tighter, the interfering traffic load must be kept lighter. In addition, there is 
another important factor, which is the buffer size. Indeed, there is a tradeoff 
between a large buffer size, with a correspondingly large queueing delay, and 
a small buffer size, with the increased possibility of packet loss due to buffer 
overflow and the resulting decrease in throughput. 

In Table P we show the results of a scenario in which a constant video load 
and FTP streams are mixed, and a video packet loss rate of 10“^ is tolerated. 
Considering the large dependence of the results on the buffer size, we experiment 
with a range of buffer sizes which scale according to the link speeds considered. 
We determine, for network delay requirements of 100 msec and 500 msec for 
the video stream, the total achievable aggregate throughput. We find that the 
buffer size which maximizes the throughput increases proportionally to the delay 
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bound, and the total achievable throughput can be increased if the delay bound 
is relaxed. Lower speed links are also more sensitive to proper buffer sizing, so it 
is not advisable to mix video with data even when interactivity is not required. 
To summarize, video should be mixed with data only on high bandwidth links. 



Table 1. Video mixed with FTP traffic: maximum achievable throughput for 100 msec 
and 500 msec end-to-end delay bounds for video (tolerable loss rate: 10“^). H(Qmax) 
is the maximum buffer delay, Qmax is the buffer size. 
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3.4 The Case for Three Classes of Service 

From the results above, we conclude that separating multimedia traffic into a 
minimum of three queues — voice, video, and data — is necessary for good per- 
formance for a range of network conditions. Interactive voice, with its high ex- 
pectations, requires a queue of its own, for protection from the bursty traffic of 
VBR video and TCP data applications. Video may be mixed with data under 
particular conditions, but other considerations also compel us to give it its own 
queue. The real-time constraints of video call for higher priority in scheduling 
compared to data. Furthermore, the reliability levels required by video are close 
to 100%, whereas TCP is designed to recover from loss. Finally, in contrast to 
data traffic, video streams should be subjected to admission control due to their 
large and predictable rates. The nature of CBR video allows it to be combined 
with voice or with VBR video. 

Yet we must be sensitive to the two extremes. High speed links tolerate bet- 
ter the mixing of disparate traffic types, such that on very high speed backbone 
routes, differential handling may be unnecessary. Likewise, very low speed links 
may require finer grain distinction of packets and/or packet preemption mecha- 
nisms. 

4 Network Node Structure 

From the discussion in the previous section, we can consider a model for the 
internal structure of a network node (output) port, which is shown in Fig. 0 
This structure is to be used at places where packet buffering is performed within 
the switching node, which could be at the input ports, the output ports, or both. 
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We assume here that the node has an output queued implementation to avoid 
having to go into the details of the different possible switching architectures. 
We describe the components of the port (classifier, traffic conditioner, buffer 
management, and scheduler) in more detail below. 




traffic Data 

conditioner 




Fig. 4. The structure of a network node port containing three queues. 



4.1 Packet Classification 

The first step in providing differentiated services is enabling network nodes to 
identify the class of service of each packet they receive, possibly through a special 
marking carried by the packet. For example, the DiffServ architecture m uses 
the byte in the IP header previously allocated to TOS and renamed DS Field, 
as a priority code. In the IEEE 802.1 LAN realm, packet identification is done 
through a field in the recently adopted VLAN tag (added to the MAC frame 
header) that indicates the class of service of each packet. The priority code is 
checked by the classifier upon reception of a packet to determine the queue in 
which the packet is to be placed. Packets carrying the same marking expect to 
receive the same treatment in the network. The discussion in the previous section 
suggests that 3 queues are necessary: one for voice, another for video, and the 
third for data. However, allowing for more queues (e.g., 8 as in IEEE 802. ID) 
increases flexibility in assigning traffic to appropriate classes. 

Before a packet is enqueued, it goes through traffic conditioners, which per- 
form functions such as metering and policing. The policer ensures that the traffic 
entering a queue does not exceed a certain limit, determined by the queue’s allo- 
cation of the link resources. This functionality is particularly needed for queues 
that are serviced with high priority in order to avoid starvation of lower priority 
traffic. 



4.2 Scheduling 

With traffic separated in multiple queues, a scheduler is required to service 
them. The scheduler’s service discipline needs to be carefully designed in order 
to provide the appropriate delay through the node for each traffic type. In this 
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section, we discuss the appropriate service discipline for each queue and the 
supporting mechanisms needed, such as admission control. 



Voice. In Sec. 3, we argued for separating voice traffic in a queue of its own. 
The next step is to determine the appropriate scheduling discipline for provid- 
ing voice with the required quality of service. In d, we show that a strict 
high priority service is appropriate for handling voice traffic. Through modelling 
and simulation, we find that, considering the conservative 99.999*^ percentile of 
packet delays, priority queueing does limit the delay and jitter of voice packets 
for typical link speeds. Refer to Fig. El where we plot the complementary cu- 
mulative distribution function (ccdf) of voice packet delays for a voice load of 
1.1 Mbps over five 45 Mbps hops. We compare different link scheduling schemes, 
namely, priority queueing with preemption of low priority transmissions, priority 
queueing (PQ), weighted round robin (WRR) with a weight of 1.5 Mbps, and 
WRR with a weight of 10 Mbps for the voice queue. We also plot the ccdf for 
the case where voice packets are given a separate 10 Mbps circuit. The graphs 
show that, as would be expected, priority queueing with preemption achieves 
negligible queueing delays over the 5 hops. In addition, non-preemptive priority 
queueing still results in low queueing delays (the 99.999*^ percentile is smaller 
than 2 msec, ignoring switching time through the node). WRR scheduling re- 
quires a large weight for the voice traffic (10 Mbps, more than 9 times the actual 
load) for it to compare to PQ. Note that the round robin scheduler insures that 
a large weight for the voice queue does not translate into wasted resources, since 
low priority traffic can utilize any unused bandwidth. In contrast, providing a 
10 Mbps dedicated circuit for voice results in delays that are better than those 
of non-preemptive priority queueing and WRR, at the cost of wasted resources. 



CCDF of Queuing Delay 
T3, G.729A, 7,= 30ms, 5 hops, 54 voice streams 
(voice load = 1 .1 Mb/s) 




Fig. 5. Voice delay distributions for different link scheduling schemes. 
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While it may appear that both priority queueing and WRR can be used, 
the results in this graph consider only one low priority queue. Therefore, one 
would expect worse results if the round robin scheduler services more queues, 
where a voice packet may have to wait for more than one low priority packet 
transmission. While WRR has the well known advantage of preventing starvation 
of low priority queues, we believe that in the context of the Internet, no special 
precaution has to be taken to prevent voice traffic from starving others. Indeed, 
voice traffic volume is limited, and its growth rate is significantly smaller than 
that of data and other applications. Therefore, its share of the total traffic is 
decreasing. This means that not only would it not starve other traffic, but also 
that voice traffic may not require per-ffow admission control. Rather, appropriate 
provisioning of the network would allow it to be serviced at highest priority. 



Video. As discussed in Sec. 3, CBR video traffic is well behaved and can 
be mixed with voice traffic. If CBR video is mixed with voice traffic in the 
same queue, admission control would be needed, given the higher rates of video 
streams. On the other hand, VBR video streams need to be mapped to a queue 
separate from voice. This queue has to be serviced using round robin schedul- 
ing; otherwise, it may starve lower priority classes due to the burstiness of its 
traffic. Similarly to CBR video, VBR video streams may need to be subjected 
to admission control. 



Data. Data applications could be all mapped to the same queue or, given the 
wide range of characteristics and requirements among them, may benefit from 
being mapped to multiple queues. Thus, if more than one data queue is available, 
low rate interactive data applications can be shielded from other data applica- 
tions by separating them. In particular, telnet would benefit from having its own 
queue, serviced with high weight, especially on low speed links. 

If the buffer sizes are chosen small enough to restrict the queueing delay of 
interactive packets to acceptable levels, all data applications may share the same 
queue. Then, differentiation among the different applications can be provided by 
assigning the packets to different drop priority levels within that queue. 

Given that most data flows are of the short lived type it is impractical 
to perform per-ffow admission control. In addition, TCP generates bursty traffic, 
and therefore it is not possible to guarantee congestion-free service for data traffic 
without significant over-provisioning. Thus, the data queue should be serviced 
with a round robin scheduler to avoid starving lower-priority queues, if any. 



4.3 Buffer Management 

To enable high speed processing, the buffers would most likely have to be imple- 
mented as FIFO queues. Therefore, any action to be taken has to be performed 
before the received packet is enqueued. Possible actions are dropping or marking 
the packets, e.g. if explicit congestion notification (ECN) is implemented |S|. 
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Early dropping/marking schemes such as Random Early Detection (RED) 
and its derivatives, aim at providing early notification of congestion before the 
buffer gets full, and bursty packet loss becomes necessary. Such schemes assume 
that an end-to-end congestion control mechanism, such as the one implemented 
in TCP, will react to the congestion signals. Therefore, they may not be effective 
or appropriate for applications which do not use such mechanisms. Indeed, differ- 
ent buffer management schemes should be used for different classes. Moreover, 
the size of the buffers should be tailored to suit the application types. Thus, 
small buffers would be used for packet-delay sensitive traffic to limit queueing 
delays, while large buffers can be used for applications that are insensitive to 
per-packet delay, but are sensitive to packet loss. 

The priority code of each packet may indicate, in addition to the queue where 
it is to be placed, one of several drop priorities within that queue. This approach 
is used in the DiffServ AF class. Providing several drop priorities within each 
queue allows further differentiation among packets and may be used to achieve 
significant improvements in quality degradation during congestion episodes, as 
discussed in Sec. 5. 

5 Improving Resilience to Congestion 
for Video and Data Applications 

With traffic types appropriately mapped to different queues in the network, the 
dynamics of each queue depend on the particular traffic it is serving. Adequately 
serviced voice traffic would see little queueing delay and loss in the network, and 
its treatment within the queue does not require further refinement. This is not 
the case for video and data traffic. Since it is not possible to guarantee congestion- 
free delivery to unshaped bursty traffic, performance degradation may occur for 
such traffic. In this section, we show how providing several packet drop priorities 
within one queue can be used to significantly improve the user-perceived quality 
of applications during congestion episodes. 

First, for video applications, we show that layered video in association with 
priority dropping is a simple, yet effective technique for providing graceful degra- 
dation in the event of packet loss. Then, we address data applications, which 
themselves span a wide range of requirements and characteristics. We show how 
identifying and prioritizing different applications can improve user experience. 

5.1 Addressing Packet Loss with Layered Video 

Let us consider the transmission of digital video over packet networks. One 
particular characteristic of video is its high sensitivity to packet loss. Video 
quality is greatly eroded when there is loss of data which contribute heavily 
to quality, such as low frequency DOT coefficients, motion vector information, 
or start codes needed for synchronization. In Fig. El we illustrate the drastic 
quality degradation of a video sequence resulting from random packet loss. The 
line segment on the left corresponds to a video sequence encoded with P frames 
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and B frames; for the line segment on the right, only I frames were used in the 
encoding of the same sequence. When the sequence contains P and B frames, 
1% packet loss can lead to poor quality because there is interdependence among 
pictures — the loss of a single packet can have an effect on multiple subsequent 
P and B pictures. We observe in the plot that when this interdependence is 
removed, such that all frames are encoded as I frames, the quality is less affected 
by packet loss. 




2 4 6 8 10 12 14 16 

Bandwidth (Mbps) 



Fig. 6. Random packet loss is applied to an MPEG-2 video stream packetized using 
RTP. Video quality degrades sharply. 



There are several possible ways to deal with packet loss. Adaptive encoding 
could be used, in which feedback from the receiver or a network node provides 
the source with information to adapt the transmission rate by modifying en- 
coding parameters. However, this is fairly complex to implement and is limited 
by feedback delay. It is not suitable for networks with high variability, for mul- 
tipoint communications, or for stored video. The technique of smoothing or 
shaping can limit the variability of the stream’s traffic, and the use of admission 
control can limit the variability in the aggregate traffic. This, too, introduces 
complexity in the nodes, and it curtails statistical multiplexing gain, decreasing 
overall throughput. Furthermore, smoothing or shaping introduces delay, clearly 
undesirable when latency is of concern. 

Because loss is inevitable, we must limit its effect by dealing with it intel- 
ligently, protecting important data and dealing with loss where and when it 
occurs. This leads us to consider layered video and priority dropping mm- 
Simply put, layered video prioritizes information according to its importance in 
contribution to quality. In conjunction with priority dropping, layered video is 
a powerful technique for maintaining quality in the presence of loss. We show 
that it offers graceful quality degradation rather than the sharp drop we saw in 
Fig. El and we show how to divide a video stream into layers to maximize the 
perceived video quality for a particular range of network conditions. 
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Video layering using data partitioning. Layering mechanisms define a base 
layer and one or more enhancement layers that contribute progressively to the 
quality of the base layer. A base layer can stand alone as a valid encoded stream, 
while enhancement layers cannot. The key observation is that some bits are 
more “important” than others; we identify their importance by placing them 
into different layers, thus allowing a node to drop packets with discrimination. 

The MPEG standards II 11121 specify four scalable coding techniques for the 
prioritization of video data: temporal scalability, data partitioning (DP), SNR 
scalability, and spatial scalability. We consider layering based on data partition- 
ing, treating temporal scalability as a special case of it. One advantage of data 
partitioning is that the overhead incurred by layering is negligible. Another ad- 
vantage is that it is performed after the encoding of the stream, allowing it to 
be easily used with pre-encoded video. 

Data partitioning divides the encoded video bit stream into two or more 
layers by allocating the encoded data to the various layers. Naturally, the data 
with the most impact on perceived quality should be placed in the base layer. 
To indicate the portion of the data which is included in the base layer, we define 
a drop code for each picture type, I, P, and B. The drop code takes on a value 
from 0 to 7, where 0 indicates that all of the data are included in the base layer, 
and 7 indicates that only the header information is placed in the base layer. 
The partitioning of the stream data into the base and enhancement layers is 
completely specified by a drop code triplet, e.g., (036). There is a correspondence 
between the drop code value and a header field defined by the MPEG standards 
that can be used for data partitioning m- 

Temporal scalability is a special case of data partitioning, where the drop 
code is 007. In temporal scalability, entire B pictures are dropped prior to drop- 
ping any information in I or P pictures. 

We illustrate in Fig. Qthe advantage of using 2-layer data partitioning in a 
network which supports priority dropping. In this figure, we show the quality 
of video using temporal scalability, data partitioning, and no layering. With 
temporal scalability, the dropping of B picture data degrades quality. If DG 
coefficients and motion vector information from B frames were not dropped, the 
decoder could have reconstructed enough of these frames to significantly improve 
the perceived quality. Therefore, not all B frame data should be assigned to 
the enhancement layer. Thus, data partitioning using a well-chosen drop code 
triplet (036) allows for graceful quality degradation as enhancement packets are 
randomly dropped. In the region where no base layer packets need to be dropped 
(above 3.8 Mbps), the quality degradation incurred varies almost linearly with 
the rate of packet loss. However, once we start losing base layer packets, the 
quality falls sharply. This establishes the need to protect the base layer from 
network loss with an appropriate nodal structure. 

We have shown the graph for drop code triplet (036). Other layering struc- 
tures (i.e., other triplets) will place the knee at different points while exhibiting 
the same two-piece behavior. We have, through simulation, identified the layer- 
ing structures which achieve the highest quality for a given bandwidth. These 
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Fig. 7. Bandwidth-quality tradeoff curves for 2-layer DP, temporal scalability, and no 
layering. 



dominant structures are the same for all sequences studied, and they correspond 
to the following triplets: 003, 014, 005, 015, 016, 036, 136, 046, 146, and 156 
m- In practice, it is desirable to choose from these triplets the proper layering 
structure such that the linear portion of the graph is large enough to just cover 
the expected bandwidth range delivered by the network. 

Quality degradation can be further improved with 3-layer data partitioning. 
In the example shown in Fig. 0 we create three layers by keeping the enhance- 
ment layer the same as in 2-layer DP with drop triplet (036). We then make the 
break between the middle layer and the base layer at the point where 2-layer 




Fig. 8. Bandwidth-quality tradeoff curves for 3-layer DP compared to 2-layer DP 
schemes. 
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DP(156) does. Thus, in the 3. 8-6.0 Mbps range (only the enhancement layer is 
dropped), the behavior parallels that of the 2-layer DP(036). The only difference 
is a slight increase in overhead. In the region where the middle layer is being 
dropped (2. 5-3. 8 Mbps), the quality is superior to the same region for DP(156), 
because the least important data have been identified and dropped first. As one 
would suppose, additional layers do continue to improve video quality, though 
with limited incremental improvement beyond 4 layers. 



Multiplexing layered streams. So far, we have examined a single video 
stream with prioritized random loss in the enhancement layers. We now con- 
sider the case where several layered streams share a limited resource. We have 
observed graceful quality degradation as the number of streams is increased, as 
shown in Fig. |3 We have also studied SNR scalability HS| but do not make fur- 
ther comment here, except to remark that it performs similarly to DP, but the 
latter is preferred because of its negligible overhead and ease of implementation. 




Fig. 9. Multiple video streams sharing a 50 KB buffer, for three schemes: SNR scala- 
bility, DP, and no layering. 



We conclude that the combination of a simple layering of video data and a 
simple priority dropping mechanism at network nodes, appropriately employed, 
can have a significant effect on sustaining video quality in the face of packet loss. 



5.2 TCP Applications 

Until now, all data applications have used best-effort service for lack of an alter- 
native. However, most Internet users have experienced times where severe quality 
loss is suffered. Such degradation is most distinctly perceived when associated 
with interactive applications. Hence, telnet interactivity is severely hindered. 
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and web page download times become excessive during congested hours. This 
is particularly unacceptable for business applications which require predictable 
service. While large page download times can be attributed in part to server over- 
load, we focus here on the delays caused by the interaction of TCP’s congestion 
control mechanisms with packet loss in the network. 

We observe that, similar to layered video, some data packets contribute more 
to user perceived quality than others. With large window sizes, the TCP fast re- 
transmit mechanism can be used to minimize the impact of packet loss. However, 
when the congestion window is small, there is a much longer delay to recover 
from a lost packet. 

Here we illustrate, using simulation results, the benefits that can be achieved 
for interactive applications when their packets are appropriately prioritized in 
the network. We consider that all data applications share one queue. Packets 
are marked at the source with one of 3 drop priorities. For FTP and HTTP, 
the SYN packet and the packets sent when the connection is operating with a 
small congestion window are marked as high priority because of the penalty in 
recovering from their loss. For telnet, all packets are marked with high priority. 
The aggregate high and medium priority traffic generated by each station is 
shaped to conform to two token bucket profiles, the goal of which is to limit the 
amount of high and medium priority packets injected by each user. The access 
to high priority tokens at the source is prioritized based on the application, with 
telnet receiving the highest access priority, and FTP the lowest. 

The topology used for the simulations consists of 800 user stations, organized 
into 400 source-destination pairs of different round trip times (ranging from 20 
msec to 200 msec), and connected by a symmetric tree with hierarchical link 
speeds (starting at 1.5 Mbps for user links, with a bottleneck of 100 Mbps). The 
router buffers are appropriately sized to provide low delay for telnet, while giving 
good performance to HTTP and FTP traffic. A randomized dropping function 
similar to RED is used for dropping packets for each of the three priorities. As the 
queue size increases, low priority packets are dropped first, followed by medium 
priority packets. High priority are only dropped when the queue size gets close 
to the maximum buffer size. Traffic consists of 1 telnet and 1 web connection 
per source-destination pair, with background traffic of repetitive short FTP file 
transfers (200 KB) in both directions. 

In the following, we illustrate how the performance of interactive applica- 
tions can be improved by appropriate service differentiation, at a modest cost to 
non-interactive applications. In Fig. E] we plot the complementary cumulative 
distribution function of web download time^ for different service differentiation 
scenarios. The curves marked DT and RED correspond to scenarios without 
service differentiation (best effort), with queues managed using Drop Tail and 
RED, respectively. As can be seen from the plot, 10% of the page downloads for 
both Drop Tail and RED suffer a delay in excess of 19 seconds. 



^ We show results for HTTP/1.0 traffic here, for a web page with eight 10 KB imbedded 
images. Up to 4 connections are opened in parallel to download the page components. 
Similar results were obtained for HTTP/ 1.1. 
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Fig. 10. The complementary cumulative distribution function of web page download 
times with (DS and APPL) and without (DT and RED) service differentiation. 



In contrast, we show the results with service differentiation, where packets 
are marked at the source based on the application and TCP connection state. 
The corresponding curve, marked DS, clearly shows a significant improvement 
in terms of download times, where all downloads take between 3 and 6 seconds. 

In Fig. m we plot the ccdf of short file transfer times, which shows that the 
improvement in page download times was obtained at little cost to the back- 
ground traffic. A simpler form of differentiation would be to base the packet 
drop priority marking solely on the generating application type. Thus, packets 
belonging to telnet and similar low-bandwidth and delay sensitive applications 
such as Internet gaming would be marked high priority, and those belonging to 
short web page downloads would be marked medium priority. Packets gener- 
ated by other, non-interactive applications such as FTP, would be marked low 
priority. The aggregate traffic of each priority is again shaped to limit its rate. 
The curves marked APPL in Figures E3and in show that this technique can 
improve the performance of web page downloads, but rather less effectively than 
the more intricate method (DS) and at a higher cost in performance loss to the 
background traffic. Similar results can be shown for telnet in this scenario, where 
appropriate differentiation is provided through marking all of its packets at high 
priority. This allows the elimination of excessive delay of character echoes (1 sec), 
which result from retransmits due to packet loss. Without service differentiation, 
these delays occur for about one out of ten typed characters (more details can 
be found in [2Uj ) . In conclusion, it is possible to use multiple drop priorities to 
the advantage of interactive applications, in order to improve the user-perceived 
performance of such applications. 
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Fig. 11. The complementary cumulative distribution function of file transfer times 
with (DS and APPL) and without (DT and RED) service differentiation. 



6 Conclusion 

For the Internet to support multimedia applications, service differentiation is 
needed. In this paper, we describe the characteristics and requirements of voice, 
video, and interactive data applications, and we demonstrate the peformance 
improvements achieved by providing different treatments for each of these three 
types of traffic. We propose a three queue model for network nodes with one 
queue for voice traffic, one queue for video, traffic, and one queue for data traffic. 
The voice queue is served with strict priority, and the video and data queues 
share the remaining capacity using weighted round robin scheduling. We also 
show that the performance of the video and data queues may be further improved 
by using multiple levels of drop precedence within each queue and by marking 
packets according to their importance. 
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Abstract. Nearly all of the multimedia streaming applications running on the 
Internet today are basically configured or designed for 2D video broadcast or 
multicast purposes. With its tremendous flexibility, MPEG-4 interactive client- 
server applications are expected to play an important role in online multimedia 
services in the future. This paper presents the initial design and implementation 
of a transport infrastructure for an IP based network that will support a client- 
server system which enables end users to: 1) author their own MPEG-4 
presentations, 2) control the delivery of the presentation, and 3) interact with 
the system to make changes to the presentation in real time. A specification for 
the overall system structure is outlined. Some initial thoughts on the server and 
client system designs, the data transport component, QoS provisioning, and the 
control plane necessary to support an interactive application are described. 



1 Introduction 

Today, most of the multimedia services consist of a single audio or natural (as 
opposed to synthetic) 2D video stream. MPEG-4, a newly released ISO/IEC standard, 
provides a broad framework for the joint description, compression, storage, and 
transmission of natural and synthetic audio-visual data. It defines improved 
compression algorithms for audio and video signals, and efficient object-based 
representation of audio-video scenes [1]. It is foreseen that MPEG-4 will be an 
important component of multimedia applications on IP-based networks in the near 
future [4]. 

In MPEG-4, audio-visual objects are encoded separately into their own Elementary 
Streams (ES). In addition, the Scene Description (SD), also referred to as the Binary 
Eormat for Scene (BIES), defines the spatio-temporal features of these objects in the 
final scene to be presented to the end user. Based upon VRML (Virtual Reality 
Modeling Language), the SD uses a tree-based graph, and can be dynamically 
updated. The SD is conveyed between the source and the destination through one or 
more ESs and is transmitted separately. Object Descriptors (ODs) are used to 
associate scene description components to the actual elementary streams that contain 
the corresponding coded media data. Each OD groups all descriptive components that 
are related to a single media object, e.g., an audio or video object, or even an 
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animated face. ODs carry information on the hierarchical relationships, locations and 
properties of ESs. ODs themselves are also transported in ESs. The separate transport 
of media objects, SD and ODs enables flexible user interactivity and content 
management. 

The MPEG-4 standard defines a three-layer structure for an MPEG-4 terminal: the 
Compression Layer, the Sync Layer and the Delivery layer. The Compression Layer 
processes individual audio-visual media streams and organizes them in Access Units 
(AU), the smallest elements that can be attributed individual timestamps. The 
compression layer can be made to react to the characteristics of a particular delivery 
layer such as the path-MTU or loss characteristics. The Sync Layer (SL) primarily 
provides the synchronization between streams. AUs are here encapsulated in SL 
packets. In case that the AU is larger than the SL packet, it will be fragmented across 
multiple SL packets. The SL produces an SL-packetized stream i.e. sequences of SL- 
packets. The SL-packet headers contain timing, sequencing and other information 
necessary to provide synchronization at the remote end. The packetized streams are 
then sent to the Delivery Layer. 

In the MPEG-4 standard, a delivery framework referred to as the Delivery 
Multimedia Integration Lramework (DMIL) is specified at the interface between the 
MPEG-4 synchronization layer and the network layer. DMIL provides an abstraction 
between the core MPEG-4 system components and the retrieval methods [2]. Two 
levels of primitives are defined in DMIL. One is for communication, between the 
application and the delivery layer to handle all the data and control flows. The other 
one is used to handle all the message flows in the control plane between DMIL peers. 
Mapping these primitives to the available protocol stacks in an IP-based network is 
still an on-going research issue. Moreover, designing an interactive client-server 
system for deployment in the Internet world using the recommended primitives is 
even more challenging [3]. 

The object-oriented nature of MPEG-4 makes it possible for an end user to 
manipulate the media objects and create a multimedia presentation tailored to his or 
her specific needs, end device and connection limitations. The multimedia content 
resides on a server (or bank of servers) and the end user has either local or remote 
access to the service. This service model differs from the traditional streaming 
applications because of its emphasis on end user interactivity. To date, nearly all the 
streaming multimedia applications that are running on the Internet are basically 
designed for simple remote retrieval or for broadcasting/multicasting services. 
Interactive services are useful in settings such as distance learning, gaming, etc. The 
end user can pick the desired language, the quality of the video, the format of the text, 
etc. 

There are only a few MPEG-4 interactive client-server systems discussed in the 
literature. H. Kalva et al. describe an implementation of a streaming client-server 
system based on an IP QoS Model called XRM [6]. As such it cannot be extended 
for use over a generic IP network (it requires a specific broadband kernel - xbind). Y. 
Pourmohammadi et al. propose a DMIL based system design for IP-based networks. 
However, their system is mainly geared toward remote retrieval and very little is 
mentioned in the paper regarding client interactivity with respect to the actual 
presentation playback [7]. In [6], the authors present the Command Descriptor 
Lramework (CDL), which provides a means to associate commands with media 
objects in the SD. The CDL has been adopted by the MPEG-4 Systems Group, and is 
part of the version 2 specification. It consists of all the features to support interactivity 
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in MPEG-4 systems [8-10]. Our transport infrastructure subsumes that command 
descriptors (CDs) are used for objects in the SD. This will enable end users to interact 
with individual objects or group of objects and control them. 

In order to support an MPEG-4 system that enables end user interactivity as 
described above, many issues must be considered. To list a few: 1) transmission of 
end user interactivity commands to the server, 2) transport of media content with QoS 
provisioning, 3) real time session control based upon end user interactivity, 4) 
mapping of data streams onto Internet transport protocols, 5) extension of existing 
lETE signaling and control protocols to support multiple media streams in an 
interactive environment, etc. 

In this paper, we present our initial ideas on the design of a client-server transport 
architecture which will enable end users to create their own MPEG-4 presentation, 
control the delivery of the content over an IP-based network, and interact with the 
system to make changes to the presentation in real time. Section 2 presents the 
structure of the overall system. Server and client designs are described in section 3. 
In section 4, data transport, QoS provisioning and control plane message exchange are 
discussed. In section 5, we conclude and discuss the future work. 



2 Overall System Architecture 

The system we are proposing to develop is depicted in Eigure 1 and consists of 1) an 
MPEG-4 server, which stores encoded multimedia objects and produces MPEG-4 
content streams, 2) an MPEG-4 client, which serves as the platform for the 
composition of an MPEG-4 presentation as requested by the end user, and 3) an IP 
network that will transport all the data between the server and the client. 

The essence of MPEG-4 lies in its object-oriented structure. As such, each object 
forms an independent entity that may or may not be linked to other objects, spatially 
and temporally. The SD, the ODs, the media objects, and the CDs are transmitted to 
the client through separate streams. This approach gives the end user at the client side 
tremendous flexibility to interact with the multimedia presentation and manipulate the 
different media objects. End users can change the spatio-temporal relationships 
among media objects, turn on or shut down media objects, or even specify different 
perceptual quality requirements for different media objects dependent upon the 
associated command descriptors for each object or group of objects. This results in a 
much more difficult and complicated session management and control architecture 
[6]. Our design starts out with the premise that end user interactivity is crucial to the 
service and therefore it targets a flexible session management scheme with efficient 
and adaptive encapsulation of data for QoS provisioning. 

User interactivity can be defined to consist of three degrees or levels of 
interactivity that correspond to what type of control is desired: 

1. Presentation level interactivity: in which a user actually makes changes to the 
scene by controlling an individual object or group of objects. This also includes 
presentation creation. 

2. Session level interactivity: in which a user controls the playback process of the 
presentation (i.e., VCR like functionality for the whole session). 
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3. Local level interactivity: in which a user only makes changes that can be taken 
care of locally, e.g., changing the position of an object on the screen, volume 
control, etc. 

Throughout our discussion below we will be making references to these three 
levels as each results in a different type of control message exchange in our system. 




Server Client 



Fig. 1. System Architecture 

We assume that the server maintains a database or a list of available MPEG-4 
content and provides WWW access to it. An end user at a remote client side retrieves 
information regarding the media objects that he/she is interested in, and composes a 
presentation based upon what is available and desired. The system operation, after the 
end user has completed the composition of the presentation, can be summarized as 
follows: 

1. The client requests a service by submitting the description of the presentation to 
the Data Controller (DC) at the server side. 

2. The DC on the server side, controls the Encoder/Producer module to generate the 
corresponding SD, CDs, CDs and other media streams based upon the presentation 
description information submitted by the end user at the client side. The DC then 
triggers the Session Controller (SC) on the server side to initiate a session. 

3. The SC on the server side is responsible for session initiation, control and 
termination. It passes along the stream information that it obtained from the DC to 
the QoS Controller (QC) that manages in conjunction with the Packer, the creation 
of the corresponding transport channels with the appropriate QoS provisions. 
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4. The Messenger Module (MM) on the server side, which handles the 
communication of control and signaling data, then signals to the client the 
initiation of the session and network resource allocation. The encapsulation 
formats and other information generated by the Packer when processing the 
“packing” of the SL-packetized streams are also signaled to the client to enable it 
to unpack the data. 

5. The actual stream delivery commences after the client indicates that it is ready to 
receive, and streams flow from the server to the client. After the decoding and 
composition procedures, the MPEG-4 presentation authored by the end user is 
rendered on his or her display. 

In the next sections, we describe in some detail the functionality of the different 
modules and how they interact. 



3 Client-Server Model 

3.1 The MPEG-4 Server 

Upon receiving a new service request from a client, the MPEG-4 server starts a thread 
for the client, and walks through the steps described in the previous section to setup a 
session with the client. The server maintains a list of sessions established with clients 
and a list of associated transport channels and their QoS characteristics. 

Figure 2 shows the components of the MPEG-4 Server. The Encoder/Producer 
compresses raw video sources in real-time or reads out MPEG-4 content stored in 
MP4 files. The elementary streams produced by the Encoder/Producer are packetized 
by the SL-Packetizer. The SL-Packetizer adds SL-Packet headers to the AUs in the 
elementary streams to achieve intra-object stream synchronization. The headers 
contain information such as decoding and composition time stamps, clock references, 
padding indication, etc. The whole process is scheduled and controlled by the DC. 

The DC is responsible for several functions: 

1. It responds to control messages that it gets from the client side DC. These 
messages include the description of the presentation composed by the user at the 
client side and the presentation level control commands issued by the remote client 
DC resulting from user interactions. 

2. It communicates with the SC to initiate a session. It also sends SC the session 
update information as it receives user interactivity commands and makes the 
appropriate SD and OD changes. 

3. It controls the Encoder/Producer and SL-Packetizer to generate and packetize the 
content as requested by the client. 

4. It schedules audio-visual objects under resource constraints. With reference to the 
System Decoding Model, the AUs must arrive at the client terminal before their 
decoding time [1]. Efficient scheduling must be applied to meet this timing 
requirement and also satisfy the delay tolerances and delivery priorities of the 
different objects. 
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The SC likewise is responsible for several functions: 

1 . When triggered by the DC for session initiation, it will coordinate with the QC to 
set-up and maintain the numerous transport channels associated with the SL 
packetized streams. 

2. It maintains session state information and updates this whenever it receives 
changes from the DC resulting from user interactivity. 

3. It responds to control messages sent to it by the client side SC. These messages 
include the VCR type commands that the user can use to control the session. 
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Fig. 2. Stmcture of the MPEG-4 Server 



3.2 The MPEG-4 Client 

The architectural design of the MPEG-4 client is based upon the MPEG-4 System 
Decoder Model (SDM), which is defined to achieve media synchronization, buffer 
management, and timing, when reconstructing the compressed media data [1]. Eigure 
3 illustrates the components of the MPEG-4 client. 

The SL Manager is responsible for binding the received ESs to decoding buffers. 
The SL-Depacketizer extracts the ESs received from the Unpacker and passes them to 
the associated decoding buffers. The corresponding decoders then decode the data in 
the decoding buffers and produce Composition Units (CUs), which are then put into 
composition memories to be processed by the compositor. The User Event Handler 
module handles the user interactivity. It filters the user interactivity commands and 
passes the messages along to the DC and the SC for processing. 

The DC at the client side has the following responsibilities: 

1. It controls the decoding and composition process. It collects all the necessary 
information, e.g., the size of the decoding buffers which is specified in decoder 
configuration descriptors and signaled to the client via the OD, the appropriate 
decoding time and composition time which is indicated in the SL packet header, 
etc., for the decoding process. 
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2. It manages the flow of control and data information, controls the creation of 
buffers and associates them with the corresponding decoders. 

3. It relays user presentation level interactivity to the server side DC and processes 
both session level and local level interactivity to manage the data flows on the 
client terminal. 



Control flow 
Data flow 




Fig. 3. Structure of the MPEG-4 Client 

The SC at the client side communicates with the SC at the server side exchanging 
session status information and session control data. The User Event Handler will 
trigger the SC when session level interactivity is detected. The SC then translates the 
user action into the appropriate session control command. 



4 Transport Architecture 

The efficient and adaptive encapsulation of MPEG-4 content with regard to the 
MPEG-4 system specification is still an open issue. There is a lot of ongoing research 
on how to deliver MPEG-4 content over IP-based networks. Figure 4 shows our 
proposed design for the delivery layer. The following sections detail the components 
in this design. 



4.1 Transport of MPEG-4 SL-Packetized Streams 

Considering that the MPEG-4 SL defines some transport related functions such as 
timing and sequence numbering, encapsulating SL packets directly into UDP packets 
seems to be the most straightforward choice for delivering MPEG-4 data over IP- 
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based networks. Y. Pourmohammadi et al presented a system adopting this approach 
[7]. However, some problems arise with such a solution. Firstly, it is hard to 
synchronize MPEG-4 streams from different servers in the variable delay 
environment which is common in the Internet. Secondly, no other multimedia data 
stream can be synchronized with MPEG-4 data carried directly over UDP in one 
application. Einally, such a system lacks a reverse channel for carrying feedback 
information from the client to the server with respect to the quality of the session. 
This is a critical point if QoS monitoring is desired. 




Packer 



Un packer 



Control flow 
Data flow 



Fig. 4. Structure of the Delivery Layer 

Another option is to deliver the SL packets over RTP, a standard protocol 
providing end-to-end transport functions for real-time Internet applications [11]. RTP 
has associated with it a control protocol, RTCP, which provides feedback channel for 
quality monitoring. In addition, the synchronization problems incurred when using 
UDP directly can be solved by exploiting the timing information contained in the 
RTCP reports. The problem arising when using RTP is the need to remove the 
resulting redundancy, as the RTP header duplicates some of the information provided 
in SL packet header. This adds to the complexity of the system and increases the 
transport overhead [12]. 

There are a number of Internet Drafts describing RTP packetization schemes for 
MPEG-4 data [12-14]. An approach proposed by Avaro et al, is attractive due to its 
solution regarding the duplication problem of the SL packet header and its 
independence of the MPEG-4 system, i.e., is not DMIF based [12]. The redundant 
information in the SL packet header is mapped into RTP header and the remaining 
part, which is called “reduced SL header”, is placed in the RTP payload along with 
the SL packet payload. Detailed format information is signaled to the receiver via 
SDP. The MPEG-4 system also defines Flexible Multiplexing (FlexMux) to bundle 
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several ESs with similar QoS requirements. The FlexMux scheme can be optionally 
used to simplify session management by reducing the number of RTP sessions needed 
for an MPEG-4 multimedia session. Van der Meer et al, proposed a scheme that 
provides an RTP payload format for MPEG-4 FlexMux streams [13]. 

Figure 5 shows the detailed layering structure inside the Packer and Unpacker. In 
the Packer, the SL-packetized streams are optionally multiplexed into FlexMux 
streams at the FlexMux layer, or directly passed to the transport protocol stacks 
composed of RTP, UDP and IP. The resulting IP packets are transported over the 
Internet. In the Unpacker, the data streams are processed in the reverse manner before 
they are passed to SL-Depacketizer. 



SL-Packetized streams SL-Packetized streams 




Fig. 5. Layering stmcture of Packer and Unpacker 

In the Packer, the multiplexing of the SL-packetized streams is processed 
according to the QoS requirements of the streams and as such managed by the QoS 
controller. At the RTP Layer, the format of the encapsulation is passed to the 
signaling agent in SDP format to notify the client. The Packer is responsible for the 
allocation of transport channels, which are defined by port numbers, with the 
information from the QoS controller. At the IP Layer, certain actions managed by the 
QoS controller will be passed onto the network layer to meet the IP QoS 
requirements. 
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4.2 QoS Provisioning 

According to the MPEG-4 system standard, each stream has an associated “transport” 
QoS description for its delivery [1], This QoS description consists of a number of 
QoS metrics such as delay, loss rate, stream priority, etc. It is then up to the 
implementation of the delivery layer to fulfill these requirements. 

In our system, the QoS Controller at the server side takes all the QoS issues into 
consideration as depicted in Figure 4. It is mainly responsible for managing the 
transport channel setup according to the required QoS parameters and controlling the 
traffic conditioning of IP packets. We model the IP network that our system is built 
on as a Differentiated Services (DiffServ) network, which is also the model for 
Internet 2. The QoS Controller maps the values of the QoS parameters to the values of 
the DiffServ Code Point (DSCP), which in turn determines the per-hop forwarding 
behavior of the IP packets. It then rewrites the DS byte in the IP header. It also 
controls the policing and rate shaping that takes place at the IP layer of the Packer. 



4.3 Exchange of Signaling, Session Control and Presentation Control Messages 

There are mainly three kinds of message flows exchanged between the server and the 
client in our system. Signaling messages are used mainly to locate the client, establish 
a network session, specify media information, modify, and tear-down an existing 
network session. Session Control messages contain commands issued by both the 
server and client to manage the session and provide real time session control to reflect 
user interactivity. Presentation Control messages are used to relay the presentation 
composed by the user and the presentation level user interactivity. As shown in 
Figure 4, three channels are maintained to carry these three distinct message flows 
between the server and the client. 

SIP is a signaling protocol for creating, modifying and terminating sessions with 
one or more participants [4]. Because of its versatility, SIP fits in well with our 
system model. We will use SDP to deliver information such as media stream 
encapsulation format, network resource allocation indication, etc. The SDP messages 
will be embedded in the SIP message payload. SIP User Agents are placed in the MM 
in our system for session initiation, termination and transport of SDP messages. 

RTSP is a control protocol used for real-time streaming applications [16]. Some 
papers have proposed schemes for adopting RTSP to provide some basic control and 
simple signaling for MPEG-4 applications [5]. However, as RTSP was primarily 
designed for media-on-demand scenarios, it cannot support the sophisticated 
interactivity required by our system. To reduce the overall system complexity, we 
have separated the signaling and control functions, and will design a Session Control 
Protocol (SCP) solely for exchanging control messages to manage the session (e.g., 
stop, resume, pause, etc.) in real time. Like SIP, the SCP agent will be placed in the 
MM to handle the message flow. 

In our design, presentation control messages are exchanged between the client and 
the server via the Presentation Control Protocol (PCP). It will be designed specifically 
to support the presentation level user interactivity. During the presentation playback, 
the end user is able to interact with, and control what is being displayed at the object 
level. For example, VCR like functionality, such as stop, pause, resume, fast forward, 
can be associated with each object or group of objects. These controls will be 
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implemented initially. More complex controls, such as the texture, dimensions, 
quality, etc., of an object, will be implemented as the design of the system matures 
and more detailed CDs are created for the objects. Similar to the other two agents, we 
will incorporate the PCP agent in the MM to communicate. 



5 Conclusions 

We presented in this paper a design for a transport infrastructure to support interactive 
multimedia presentations which enable end users to choose available MPEG-4 media 
content to compose their own presentation, control the delivery of such media data 
and interact with the server to modify the presentation in real-time. Detailed structures 
of the server and client were described. We discussed the issues associated with the 
delivery of media data and information exchange using a dedicated control plane that 
supports the exchange of signaling, session control and presentation control messages. 

In summary, our work is focused on the following issues: 

1 . Translation of user interactivity into appropriate control commands 

2. Development of a presentation control protocol to transmit user interactivity 
commands and a session control protocol to enable real time session management 
and information exchange to support user interactions 

3. Processing of the presentation control messages by the server to update the MPEG- 
4 session 

4. Transport of multiple inter-related media streams with IP based QoS provisioning 

5. Client side buffer management for decoding the multiple media streams and scene 
composition 
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Abstract. In the last few years researchers have made great effort to design 
TCP-friendly UDP-hased congestion control procedures to he applied to real- 
time applications; however, very little effort has been devoted to investigating 
how real-time sources can be designed under TCP-friendly bandwidth 
constraints. In this perspective, this paper investigates the effects of the 
bandwidth variation rate during bandwidth profile variations on both the error 
of the rate controller in fitting the bandwidth profile, and the distortion 
introduced by the quantization mechanism of the MPEG video encoder. To this 
end we introduce an SBBP/SBBP/UK queueing system modeling an MPEG 
video source where a feedback law is used to provide rate adaptation. The 
proposed paradigm addresses any feedback law, provided that the parameter to 
be varied is the quantizer scale. 



1 Introduction 

Distributed multimedia applications usually employ the UDP protocol to transmit 
video streams, because the delays added by relying on TCP retransmission 
mechanisms are unacceptable in real-time video transmission. As these applications 
are spreading, it is becoming increasingly important to ensure that they are able to co- 
exist with current TCP-based applications. In particular, as UDP does not embed any 
congestion control mechanism, video sessions are unresponsive sessions, that is, they 
do not hack off their rates in time of congestion as TCP does. This behavior is 
unacceptable in two ways: first, coarsely fair sharing of network resources is no 
longer maintained as TCP sessions obviously suffer from competing with video 
sessions; secondly, as more and more “greedy” connections are set up across the 
Internet, the goodput of the network will decrease because unresponsive sessions 
typically send data packets at full rate even if their packets are later dropped inside the 
network. 

It is therefore envisioned to enhance these UDP-based video communications with 
some kind of congestion control, in order to make them hehave like “good network 
citizens” at times of bandwidth scarcity [1]. Achievement of the above scenario 
involves two challenging tasks. The first concerns the design of TCP-friendly 



S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 413-432, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




414 A. Cemuto, A. Lombardo, and G. Schembra 



congestion control procedures for UDP-based sources: they have to be designed to 
provide a video connection with the same share of bandwidth as a TCP connection 
when they share a bottleneck over the same network path. The second concerns the 
definition of rate-adaptation mechanisms in the video sources which are able to shape 
the offered throughput according to the bandwidth profile allowed by the congestion 
control procedure. With respect to this task, the MPEG encoding standard is one of 
the most promising techniques in video encoding, thanks to its high compression ratio 
and flexibility. Let us note that although achievement of the first task alone, saves the 
network from unfairness or monopolization risks, it would not result in meaningful 
reception for the video application. Unresponsive video sources, in fact, send data 
packets at full rate even if their packets are delayed by the underlying TCP-friendly 
congestion control procedures; so the increasing delay experienced by the video 
source will result in loss of synchronization at the receiving side, which needs to skip 
a potentially high number of packets to recover synchronization requirements. This 
occurrence results in a breakdown of the QoS perceived by the end user. 

In the last few years researchers have made great efforts to design TCP-friendly 
UDP-based congestion control procedures to be applied to real-time applications; 
however, very little effort has been devoted to investigating how real-time sources can 
be designed under TCP-friendly bandwidth constraints. In this paper we investigate 
the effects of the bandwidth variation rate during bandwidth profile variations on both 
the error of the rate controller in fitting the bandwidth profile, and the distortion 
introduced by the quantization mechanism of the MPEG video encoder. To this end 
we introduce an analytical framework modeling an MPEG video source where a 
feedback law is used to provide rate adaptation, as is usual in MPEG encoders [2-3]. 
The proposed paradigm addresses any feedback law which takes into account both the 
state of a counter, used to keep track of the previous encoding history, and the activity 
level of the next frame to be encoded; by so doing the model can be applied whatever 
the feedback law is, provided that the parameter to be varied is the quantizer scale. 

The paper is organized as follows. Section 2 describes the scenario being 
addressed. In Section 3, in order to derive the analytical model, we introduce a 
statistical analysis of MPEG traces aimed at characterizing both the activity process 
and the activity/emission relationships of the output flow of an MPEG encoder. Then, 
Section 4 defines the analytical framework modeling the adaptive-rate MPEG video 
source; to this end, first the model of the non-controlled MPEG source, defined as a 
switched batch Bernoulli process (SBBP) [4], is introduced, following an approach 
similar to the one proposed by the authors in [5-6]; then the adaptive-rate MPEG 
source is modeled as an SBBPISBBPIl/K queueing system. Section 5 describes how 
the analytical framework can be used to evaluate the performance of an adaptive-rate 
MPEG source in terms of both the output-rate statistics, and the distortion introduced 
by the quantization mechanism. In Section 6 the above paradigm is used to investigate 
the effects of the bandwidth variation frequency on the above performance. Einally, 
the authors’ conclusions are drawn in Section 7. 



2 System Description 

The system we will refer to in this paper is shown in Fig. 1. It is constituted by an 
adaptive-rate MPEG video source over an UDP/IP protocol suite. The adaptive-rate 
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Fig. 1.: MPEG video encoder system 



MPEG video source can be, for example, a TM5 MPEG video source [3]. In order to 
be “friendly” with TCP sources present in the network, a TCP-friendly layer is put 
between the UDP layer and the encoder. The TCP-friendly layer measures network 
delays and losses and provides the bandwidth indication to the adaptive-rate source. 
Many TCP-friendly protocols have been defined in literature (see for example [7-8]); 
they differ from each other for the method of measuring losses and delays, the 
measurement frequency, and the technique to choose the bandwidth to be indicated to 
the source. 

Let us express the bandwidth indicated by the TCP-friendly layer to the adaptive- 
rate source in terms of packets/slot; given that this bandwidth changes in time due to 
network loss and delay variations, it is a stochastic process which we will indicate as 

N{n) . Therefore N{n) indicates the number of packets the network accept for 
transmission at the slot n. 

The adaptive-rate source is based on a video source whose output is MPEG 
encoded with a quantizer scale parameter {qsp) that varies according to the feedback 
provided by a rate controller, and then packetized according to the packet format 
imposed by the network. The rate controller works according to a feedback law with a 
given target. A possible target can be, for example, to encode the generic frame n with 

a number of packets equal to N{n) . In order to achieve its target, the rate controller 
monitors both the activity of the frame which is being encoded, its encoding mode, 
and the number of packets used to encode the previous frames. To obtain this last 
information, the rate controller uses a counter unit. 

The counter unit in the rate controller is incremented at each slot by the number of 

packets emitted by the MPEG encoder at the UDP layer output, Y in ) , and decreased 
by the number of packets indicated by the TCP-friendly layer, [- , /f J . Thus the 

value of the counter unit at each slot represents the credit/debt the encoding system 
has with respect to the desired target; it therefore assumes positive values, that is, it 
registers a debit, when the encoder emitted too many packets with respect to the 
desired target, or negative values, that is, it registers a credit, when the encoder 
emitted too few packets with respect to the desired target. Too high positive or 
negative counter values allow very large windows to store the encoder emission 
history, but they may cause high output-rate burstiness; we therefore bound counter 



416 A. Cemuto, A. Lombardo, and G. Schembra 



values in a range [- when the counter takes values over or below -K^ 

packets, it is truncated to or -K ^ , respectively. 



3 MPEG Statistical Properties 

An MPEG encoder generates a pseudo-periodic emission process depending on the 
activity, the frame encoding mode (/, P, or B), and the qsp used to encode the current 
frame. This emission process can be described by the activity process and the 
activity/emission relationships. The activity process only depends on the peculiarities 
of the scene being encoded; the activity/emission relationships represent the number 
of bits resulting from the encoding of a picture characterized by a given activity, and 
therefore depend on both the frame encoding mode and the qsp used to encode each 
frame. Of course, the qsp determines the distortion introduced by the encoder; for this 
reason, in this section we also address the emission/distortion relationships, that is, the 
relationships binding the number of bits emitted and the distortion introduced for each 
encoding mode and for each qsp. 

We analyzed the statistical characteristics of one hour of MPEG video sequences 
of the movie “The Silence of the Lambs” with the tool [9]. To encode this movie we 
used a frame rate of F = 2A frames/sec, and a frame size of M = 180 macroblocks. 
The GoP structure IBBPBB was used, selecting a ratio of total frames to intraframes 
of = 6 , and the distance between two successive P-frames or between the last P- 

frame in the GoP and the /-frame in the next GoP as = 3 . 

Referring to this video sequence, in the next section we will first study its activity 
process (Section 3.1), then the activity/emission relationships (Section 3.2), and 
finally the emission/distortion relationships (Section 3.3). 



3.1 Characterization of the Activity Process 

According to the MPEG standard three elements are considered to encode a movie 
sequence: luminance, Y, chrominance Cb, and chrominance Cr. However, as 
luminance is the most relevant component characterizing the perceived quality, the 
activity of a video sequence is usually characterized using the luminance of each 
frame only [10]. The activity of the macroblock p in the frame n is defined as the 
variance of the luminance within this macroblock. So we can define the frame activity 
process, indicated here as the activity process, as the discrete-time process L(n) 
obtained averaging the activities, L^in), in the macroblocks within the frame n. In 

order to represent the activity process statistically, we measured its first- and second- 
order statistics in terms of the probability density function (pdf), (a) , 

VflG [o,a^jj.], and normalized autocovariance function, C^(m), respectively, where 
maximum value of the activity process As regards the first-order statistics, 
in [11] it was demonstrated that the Gamma function, defined as: 
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m-l _a 

GAMMA_ _ (fl) = ” 

r{m)r 

is a good approximation of the pdf, that is: 

f^(a) = GAMMA__{a) 

The terms m and t in (1) and (2) are the so-called “shape parameter” and “scale 
parameter”. If we indicate with fx^ and cr^ the mean value and the variance of the 

activity process, m and t are defined as follows: 

m= filial = 

Therefore the mean value, jX ^ , and the variance, cr^ , of the activity process are 
sufficient to describe the first-order statistics of this process. For the movie considered 
here we measured jx^ = 89.63 bits/frame and cr^ = 2.3041 • lO'* (bits/frame)^ . 

As far as the second-order statistics are concerned, we have observed that the 
decreasing trend of the normalized autocovariance function curve can be 
approximated by a linear combination of exponential functions [4], that is: 

w r T (41 

Co. [l’ ] 

W=1 

where m is the width of the interval in which we want to fit the measured 

MAX 

normalized autocovariance function and W is the number of exponential terms needed 
to minimize the approximation error. From a great number of measures we have 
observed that two exponential terms are sufficient to achieve an acceptable error, and 
the use of more exponential terms produces no tangible improvement. 

3.2 Characterization of the Activity/Emission Relationships 

The MPEG emission process is modulated by the activity process described in the 
previous section. Moreover, while the activity process does not depend on the 
encoding mode and the qsp value used, the emission process does. In this section we 
will characterize the dependence of the emission process on the activity process when 
a fixed qsp value q is used. As an example, we used q = 5 . Let us indicate the overall 

emission process which results from the encoding with a fixed qsp q as X^{n ) , and 

let us decompose it into different emission processes, one for each frame in the 

GoP, , where je J = [i,gJ indicates the frame position in the GoP, and h a 

generic GoP. Of course, we have (h) = X^{h-G^ + j). For example, in the case of 

Gj = 6 and G^ = 3 , X^^\h) will refer to an /-frame if j& J ^ ={l}, a P-frame if 

j e J ^ = {4} and a P-frame if je = {2,3,5, 6} . Given that frames of the same kind 



( 1 ) 



(2) 
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in the same GoP in any MPEG sequence have very close statistical parameters 
[5][12], in what follows, unless specified otherwise, we will only consider three 
emission processes, one for each kind of frame, X^^\m) , for each ge {l,P,B}. 
Likewise, we will consider three different activity processes, one for each frame 
process, and we will indicate them as , for each ge {/,P,Z?}. Moreover, in 

the following sections we will indicate the mean value of the /-, P- and B-frame 
emission processes as and , and the variance of the same processes as (T ^ , 

and . 

P B 

Now we can define the activity/emission relationships when a fixed qsp value q is 
used as the distribution of the sizes of /-, P- and B-frames once the activity a of the 
same frame is given: 



y‘^’(r|a)= limProb{x‘^'(m)= r,\/js a\ 



VoE 1 

L ’ MAX J 

VgE {/,B,B} 



( 5 ) 



As demonstrated in [5][12], these functions can also be well approximated by Gamma 
distributions: 



y^^\r\a)= GAMMA 



ir) 



VgE {/,B,B} 



(6) 



where ^ ^ and ^ ^ can be derived as in (3) for each frame encoding mode g . 

So we can say that, for each activity a and for a given qsp value q, the 
activity/emission relationships can be exhaustively described by the mean value and 
variance of the /-, P- and B-frame emission processes, indicated here as ^ and 

cf , , , U, , , and cf , , , and u, , , and (7^ , , , respectively. 

Ulo)., ’ (P\aU ’ (bH‘I ^ ^ 

Fig. 2 shows them as functions of the activity values. In this figure we can observe 
that the mean values present a quite linear and increasing law, while the variances 
follow a parabolic law. So, Va£ [0,a^^] and VgE {/,B, B}, we can approximate 
these functions by means of the following functions: 





(a): Mean value (b): Variance 

Fig. 2: Mean and variance of the /-, P- and B-frame processes vs. the activity (solid line), 
compared with the best-fitting curves (dashed line). 
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(7) 



Finally, we can state that the coefficients of these functions completely 
characterize the activity/emission relationships for a given qsp value q. 



3.3 Characterization of the Emission/Distortion Relationships 



The distortion introduced by quantization is one of the most important aspects of 
video encoding. When the qsp is varied during the encoding process to shape the 
encoder output rate, the quantization distortion is not constant: the higher the qsp 
values, the greater the quantization distortion. The distortion can be represented by 
the Peak Signal-to-Noise ratio (PSNR) process, defined as follows: 



PSNR(n) = 10- log 



2“ -I 
MSE(n) 



( 8 ) 



where d is the number of bits assigned to a pixel, and MSE{n) is the mean square 
error caused by quantization of the frame n. Fig. 3 shows the rate-distortion curves, 
F‘''’\q) and for each frame encoding mode g e {/, P, fi} , measured 

from the movie here considered. More specifically. Fig. 3a shows the rate curves, 
, defined as the average value of the emission process versus the qsp 

value, q. Fig. 3b shows the distortion curves, E''^\q) , defined as the average value of 
the process PSNR{n) versus the qsp value, q. Finally, Fig. 3c links the rate and 
distortion curves shown in Figs. 3a and 3b, by plotting the function 
representing the average number of bits emitted for each frame q e {/, P, B} in order 
to obtain a given average value, tp , of the process PSNR(n) . Any pair of these 
curves completely characterizes the distortion introduced by quantization. 






(a): Rate curves 



(b): Distortion curves (c): Rate vs. distortion curves 
Fig. 3. Rate-distortion curves 
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4 System Model 

The target of this section is to derive a discrete-time analytical model of an adaptive- 
rate MPEG video encoding system when its output is controlled with feedback from a 
TCP-friendly protocol. We will use A = l/E as the slot duration, taken as being equal 
to the frame interval duration. 

As a first step, in Section 4.2 we model the non-controlled MPEG encoder output 
as a switched batch Bernoulli process (SBBP) [4]. Then, the adaptive-rate MPEG 
video source is modeled as an SBBPISBBPIl/K queueing system in Section 4.3, where 
K = K^ + K^ + l is the range of the counter state. Eor the sake of completeness, 

Section 4.1 provides a brief outline of SBBP processes, which will constitute the 
model of both the input and the output of the queueing system. 



4.1 Switched Batch Bernoulli Process (SBBP) 

An SBBP Y(n) is a discrete-time emission process modulated by an underlying 
Markov chain. Each state of the Markov chain is characterized by an emission pdf: 
the SBBP emits data units according to the pdf of the current state of the underlying 
Markov chain. Therefore an SBBP Y(n) is fully described by the state space of 
the underlying Markov chain, the maximum number of data units the SBBP can emit 
in one slot, , and the set ), where is the transition probability 

matrix of the underlying Markov chain, while is the emission probability matrix 
whose rows contain the emission pdf’s for each state of the underlying Markov chain. 
If we indicate the state of the underlying Markov chain in the generic slot n as 
, the generic elements of the matrices ^nd are defined as follows: 

Cvl = ^ 3'^' (9) 

B,^’ , = limjy (m) = (n) = e 3"^’ , Vi g (10) 

Below we will introduce an extension of the meaning of the SBBP to model not only 
a source emission process, but also a video sequence activity process, and a network 
bandwidth process. In the latter cases we will indicate them as an activity SBBP and a 
bandwidth SBBP, respectively, and their matrices B“^’ as the activity probability 
matrix and the bandwidth probability matrix, respectively. 



4.2 Non-controlled Source Model 

In this section we derive the SBBP process T (n) , modeling the emission process of 
the non-controlled MPEG video source at the UDP layer output for a given qsp, 
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The model has to capture two different components: the activity process behavior 
and the activity/emission relationships. More specifically, the transitions between the 
activity states, and between one frame and the successive one within a GoP, have to 
be modeled simultaneously. We will obtain the model of the non-controlled MPEG 
video source in three steps: 

1 . derivation of an activity SBBP, G(n ) , modeling the activity process, L(n ) ; 

2. derivation of the SBBP T (n ) , representing the whole measured MPEG emission 
process in ) , first calculating its underlying Markov chain from the underlying 
Markov chain of the activity SBBP G(n ) , and then the emission process from the 
activity/emission relationships defined in (5); 

3. derivation of the SBBP T(n) at the packetizer output. 

The desired activity SBBP, G(n) , has to fit the first- and second-order statistics, 
/^(r) and C^(m ) , of the activity process L(n) . Let us indicate the state space of the 
underlying Markov chain of the activity process Gin) as . It represents the set of 
activity levels to be captured. Eor example, according to [13], we have 
2 <g) _ {Very low. Low, High, Very high}. The activity SBBP G{n) is defined by the 
parameter set where is the transition matrix among the states in 

, as it is customary, whereas does not represent an “emission probability 
matrix”, but an “activity probability matrix”. The matrix 5*°’ is defined as follows: 
when the underlying Markov chain of G(n) is in the state representing the activity 

level i, the activity process takes values according to the probabilities in the i'* row of 
the activity probability matrix . The activity SBBP G(n ) , defined by the 
parameter set ), can be obtained by solving the so-called inverse eigenvalue 

problem as in [5] [14]. 

Prom the activity SBBP G(n) modeling the activity process, we can derive the 
SBBP T (n) modeling the non-controlled MPEG source when the qsp q is used. To 
this end let us define the state of the underlying Markov chain of T («) as a double 
variable, S^''\n) = {S^°\n),S^'"\n)), where is the state of the 

underlying Markov chain of G(n) , and S^’’\n)e J is the frame position in the GoP 
at the slot n. Let us note that we have used U = 548 • 8 instead of S ’ (n) because the 
underlying Markov chain of T (n) is independent of q. For the same reason, below 

we will indicate its transition probability matrix as instead of . 

To calculate the above matrix, let us note that the admissible transitions are only 
between two states such that the frame at the time slot « -I- 1 , -I- 1) , is the one in 

the GoP following the frame S‘^'"\n) . Thus the transition probability from the state 
S’^^'^n) = {i\ f) to the state S’^''\n + V) = {i",j"), with i' , i" e 3*°’, and /, j" e J , 
is: 
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Q 



iY) 



0 otherwise 



( 11 ) 



As far as the emission probability matrix, B ’ , is concerned, its generic element 
depends on the frame encoding mode, the frame activity, and the qsp used. Once the 
qsp has been fixed, this matrix can be obtained from the activity/emission 
relationships. In fact, the probability of using r bits to encode the frame j when the 

state of the underlying Markov chain of the activity process is i e and the qsp q 
is used, is: 





a=0 


if j e J, 




B*"’’ =< 


“max 

s ygmeiz 

a=0 


if j e Jp 






.,( L ) 

“max 

S yTW-BlZ 

a=0 


if Jp 




where y‘^’(r|a), for each ge{l,P,B}, is the function 
characterize the activity/emission relationships, while is 


introduced in (6) to 
the probability that the 



activity is a g when the activity level is i g . The set defines 

the SBBP emission process modeling the output flow of the non-controlled MPEG 
encoder, when this uses a fixed qsp value q . 



Finally, the last step is to obtain the MPEG source model at the UDP layer output. 
According to the UDP/IP protocol suite, let us denote the packet payload size 
available to the source to transmit information as U = 548 • 8 bits. The emission 
process T (n ) , expressed in bits emitted per slot, can easily be transformed into an 

emission process T (n ) , expressed in UDP packets emitted per slot. In fact, its 

transition probability matrix and emission probability matrix can be derived as 
follows: 



I'j- .'y J l-'y .Sy 



V/ ,/g 3* 



(13) 




r-U 



I 

r =(?- l > C/+l 




Vi G3"'\V?G[o,r'"''’ 

T ’ L ’ MAX-i 



(14) 



In (14), is the maximum number of packets needed to transmit one frame and is 

given by where is the smallest integer not less than x. In the 

following, for the sake of simplicity, we will indicate the number of emitted packets 
as r instead of r . 
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4.3 Adaptive Source Model 

The adaptive source pursues a given output rate target by implementing a feedback 
law, q = <t>{i,j,s^) , in the rate controller, which calculates the qsp q to be used by the 

MPEG encoder for each frame. Here we will indicate the output emission process of 
the adaptive source, expressed in packets/slot, as Y (n) . 

To model the adaptive source we have to consider the system shown in Fig. 1 as a 
whole, indicated here as Z , in which the counter is incremented by the source 

emission process Y (n) , and decremented by the network bandwidth process N(n) . 
We model the counter unit with a discrete-time queueing system model with a 
dimension of {k^ -f -f l), where [- K^,K^ is the range of variation of the counter, 

as discussed in Section 2. Both the input and the output processes can be 
characterized as two SBBP processes, and the slot duration is the frame duration, 
A = 1/E . 

The server capacity of this queueing system, that is, the number of packets which 
leave the queue at each time slot, is a stochastic process which coincides with the 

network bandwidth process N(n) . Let us model this process with an SBBP process, 
called bandwidth SBBP in Section 3.1. Let it be characterized by the parameter set 
, B*"’ ) , let 3**’’ be the state space of its underlying Markov chain, and be 
the maximum number of packets that can leave the buffer in one slot. 

The bandwidth SBBP N{n) can be equivalently characterized through the set of 

transition probability matrices, C''^\d) , including the probability that the server 
capacity is of d packets. These matrices can be obtained by the parameter set 
as follows: 









(A) 
MAX -I 



(15) 



To model the queueing system, we assume a late arrival system with immediate 
access time diagram [4]: packets arrive in batches, and a batch of packets can enter 
the service facility if it is free, with the possibility of them being ejected almost 
instantaneously. Note that in this model a packet service time is counted as the 
number of slot boundaries from the point of entry to the service facility up to the 
packet departure point. Therefore, even though we allow the arriving packet to be 
ejected almost instantaneously, its service time is counted as 1, not 0. 

A complete description of Z at the n'* slot requires a three-dimensional state, 
S^^\n) = 5 *’’’(«)), where: 

• S‘^°\n)e [— E is the virtual-buffer state in the n'* slot, i.e. the number of 
packets in the queue and in the service facility at the observation instant; 

• is the underlying Markov chain of the bandwidth SBBP N(n ) ; 
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• is the state of the underlying Markov chain of Y (n ) , which coincides with 

that of y^(n ) , Vq , given that, as said in Section 4.2, it is independent of the qsp 
used. 

According to the late arrival system with immediate access time diagram, let us note 
that, if we indicate the virtual-buffer state, the number of arrivals and the server 
capacity in the generic slot n as s'^, r and d, respectively, the virtual-buffer state in 

the generic slot (n -I- 1) , , can be obtained through the Lindley equation: 

s"^ = max (min (i' -I- r, )- d-K ^ ) (16) 



Now, in order to derive the adaptive-source model, first let us characterize the 
queueing system input process, which is an SBBP whose emission probability matrix 
depends on the virtual-buffer state. To this end, we apply the algorithm introduced in 

Section 4.2 to obtain an SBBP model of the MPEG video source, T in ) , for each qsp 

qe [l,3l] . So we have a parameter set which represents an 

SBBP whose transition matrix is , and whose emission process is characterized 



by a set of emission matrices, |B' 3 j , consisting of one matrix for each qsp value 

q\ at each time slot the emission of the source is therefore characterized by an 
emission probability matrix chosen according to the qsp value defined by the 
feedback law q = (fi(i, j, s ^ ) . 



More concisely, as we did in (15) for the bandwidth SBBP, we characterize the 
emission process of the adaptive-rate MPEG video source through the set of matrices 

|c(’^’’(r)| , Vre each matrix representing the transition probability 



<?=!.. .31 



matrix including the probability of r packets being emitted when the buffer state is s 



and the qsp is q. So the generic element of the matrix C^^‘‘'\r) can be obtained from 

'<2 

the above parameter set, as a function of the qsp value q" : 



C'J''\r) 






= G, 



[(.'./),(■■ 



■./)] [(- 



.n.r 



(17) 



where q" = , f , s'^) is the qsp chosen when the frame to be encoded is the j" -th 

in the GoP, its activity level is i " , and the virtual-buffer state before encoding this 
frame is . Finally, if we indicate two generic states of the system as s'^ = (s'^,s'^,s'^) 

and A*), the generic element of the transition matrix of the MPEG 

encoder system as a whole, > can be calculated, thanks to (16), as follows: 
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q(£) 

[(•^e ’■'i' ’'iv 



d 



{N) 

MAX 



AY) 

'max 



I I 



(4 +r ,*:2 





(18) 



Once the matrix is known, we can calculate the steady-state probability array of 
the system Z , , as the solution of the steady-state system. 



5 Performance Evaluation 

As said in Section 2, the adaptive source has the target of keeping the output traffic 
stream Y (n) compliant with the bandwidth provided by the network. Therefore the 
main performance parameters are the output process statistics, which will be 
analytically derived in Section 5.1. Moreover, given the variability of the network 
bandwidth, other important parameters are the statistics of the distortion process, 
represented by the PSNR. These statistics will be obtained in Section 5.2. 



5.1 Output process statistics 



In this section we calculate the pdf, f-{r ) , Vre [o, of the encoding system 
output process. It can be obtained as follows: 



/c 



(r) = lim Prob{r (n) = r}= lim 



ZEE 

(iJ)s3<‘'i 



Y{n) = r,S'-^^ {n) = s^, 
(n) = {ij) 



(19) 



Now, applying the theorem of total probability, and taking into account that the 
emission process in one slot does not depend either on the virtual-buffer state or on 
the bandwidth process in the same slot, we have: 



^2 




S^°\n) = s^, ’ 


E E E 


Y(n) = r 


5'"’(n) = ^-, 


.,2=-*:, (.,j>3<'’> 




V>'>(n) = (i,i) 



(20) 



= l I 

(.■j>3<>'I 






■n. 






where q = (j)(i,j,s^) . Finally, from the pdf f-{r ) , the mean value and variance of the 
output process can also be easily derived. 
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5.2 Quantization Distortion 

The target of this section is to evaluate both the static and time-variant statistics of the 
quantization distortion, represented here by the process PSNR(n) defined in ( 8 ). To 
this end, let us quantize the distortion curves with a set of L different levels of 
distortion, each representing an interval of distortion values where the 

quality perceived by users can be considered as constant. As an example, for the 
movie “The Silence of the Lambs”, from a subjective analysis obtained with 300 test 
subjects, the following L = 5 levels of distortion were envisaged: (p^ = [20.6,34.2] 

dB, (p^=]34.2,35.0] dB, = ]35.0,36.2] dB, (p^ = ]36.2,38.4] dB, and 

= ]38.4,52.l] dB. From the distortion curves introduced in Section 3.3, we 

can define the array 7 '^’ whose generic element, 7 ]'^ = {V(jr such that g (p^}, 

for each / g [l,i] , is the qsp range providing a distortion belonging to the /'* level for 
a frame g e {l,P,B}. Of course, by so doing we are assuming that a variation of q 
within the generic interval 7 ]^’ does not cause any appreciable distortion. 

From the distortion curves in Fig. 3b, we can calculate the following qsp ranges 
corresponding to the above distortion levels (p ^ , for each / g [1,5] : 

• for the /-frames: 7 *'’ = [[16, 31], [13, 15], [10, 12], [6, 9], [1,5]] ; 

• for the P-frames: 7 ''’’ = [[15, 31], [13, 14], [10, 12], [6, 9], [1,5]] ; 

• for the B-frames: 7 *"’ = [[17, 31], [14, 16], [11, 13], [7,10], [1,6]]. 

As in the previous sections, let q = (j){i, j, ) be the feedback law, linking the virtual- 

buffer state, s^e [— Ai^], the activity level, i e and the position of the frame 

in the GoP to be encoded, je / = [l, ] , to the qsp to be used in order to fit the time- 

varying bandwidth. Moreover, let such that q = (j)(i,j,s^)e 7 ]^’}, for each 

i and j, be the range of values of the virtual-buffer state for which the rate controller 
chooses qsp values belonging to the level (p^ . It follows, by definition, that a variation 

of the virtual-buffer state within does not cause any appreciable distortion 

variation. 

So, we can now calculate the probability that the value of the process PSNR(n) is in 
the generic interval (p ^ , ^ and the pdf (m) of the stochastic variable , 

representing the duration of the time the process PSNR(n) remains in the generic 
interval q>^ without interruption. They are defined as follows: 

^( PSNR ) _ 




(PSNR.J) 



(21) 
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fg {fn) = limProbj 



PSm{n + l)& 
PSNR{n + m — l)e (p^ 
PSNR{n + m)i (p^ 



PSNR(n-l)i (p^A 
PSNR{n) e(p^ j 



(22) 



The term in (21) is the probability that the value of the process PSNR{n) is 

in the generic interval <p^ for the frame in the GoP. It can be calculated from the 
system steady-state probability array as follows: 



(PSNR.i) _ 



I I- 

N ^ ‘ 



(2) 



( 23 ) 



In order to calculate the pdf (m) in (22), let us indicate the matrix containing the 

one-slot state transition probabilities towards system states in which the distortion 
level is (p^ as ■ It can be obtained from the transition probability matrix of the 

system, , as follows: 



0 otherwise 



( 24 ) 



Therefore, the pdf / (m) can be calculated as the probability that the system Z , 
starting from a distortion level (p ^ , remains at the same level for m-l consecutive 
slots, and leaves this level at the m'* slot, that is [15]: 



(m) = 






•1 



where n 



n 



■Q 



(2) 



(25) 



n 






e™ .f 



The array in (25) is the steady-state probability array in the first slot of a 

period in which the distortion level is q>^ . Instead, the array is the steady-state 

probability array in a generic slot in which the distortion level is other than (p^ : 



n 






n 



m 



•Q™ 






( 26 ) 



At this point we have calculated the pdf of the random variable . Now its mean 
value can be obtained as follows: 




• /. (m) = n 









.f 



( 27 ) 



where I is the identity matrix. 
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6 Numerical Results 

Let us now apply the analytical framework proposed in the previous sections to 
numerically evaluate the performance of the MPEG adaptive source when the output 
rate is controlled by the TCP-friendly layer. To this end we address the encoding of 
the movie “The Silence of the Lambs”, which was studied in Section 3, and we use 
the same feedback law used by the TM-5 standard, that is, with the target of obtaining 
a constant number of packets per GoP, leaving the number of packets for encoding 
each frame in the GoP free in order to pursue a constant distortion level within the 
GoP. At the first slot of the generic GoP h, for each h, the TCP-friendly algorithm 
calculates the number of packets to be used in that GoP, (h) , and holds the 
bandwidth process value constant for all the frames of that GoP to the value 
N(n) = (h) jG^ , for each mg [/j • -t 1, (/i -t 1) • gJ . 

The qsp is chosen as follows: 

PSNR^ = PSNR Vk e [; + 1, G, 1 
R,Jq)+ X + 

k=j+l 

where: 

• PSNR. is the PSNR of the j"" frame in the GoP if it is encoded using the qsp q ; 

• R^jiq) is the expected number of packets used to encode the j"" frame in the 
GoP, if the qsp used is q and the activity state is i. It can be calculated as follows: 

= X ■ Km g = V if y G if J e ■// ; if J e /, } 

0=0 

• Q. . is the number of packets still available to encode the j* and the remaining 
Gj - j frames of the GoP. These packets will be distributed to maintain the same 
distortion level in the GoP. Given that, by this law, the counter at each slot n is 
decreased by N(n ) , and incremented by the number of packets actually used by 
the encoder in the same interval, if we indicate the mean value of the bandwidth 
process N{n) as eIaCm)}, we have: 

j(i5) 

T “max 

N(«)}= X X 

d=0 ^ 




r minumum q such that : 






such that : \ 



(30) 
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(b) Bandwidth SBBP pdf - (c) Bandwidth SBBP pdf - 

state 1 state 2 

Fig. 4. Network bandwidth process pdfs 



= G, ■ £{a^(«)}- - O’ - 1) ■ £{A^(n)} 



(31) 



Thanks to the counter, the above feedback law considers the packet debit/credit 
resulting from both the previous GoP and the previous frames in the same GoP. 

For the TCP-friendly protocol we used the TFRC protocol [7]. It is an equation- 
based protocol which calculates the sending rate, expressed in bytes/sec, through the 
TCP response function. Implementing the TFRC protocol in the TCP-friendly layer 
and applying it to a “greedy” source (i.e. a source which always has something to 
transmit) in the intranet of the Catania University Campus, the resulting output 
process pdf is shown in Fig. 4a. In order to apply the framework proposed in the 
paper, we need an SBBP model of the output bandwidth. Given that it is beyond the 
scope of this paper, we assume a two-state model with the same steady-state 
probability for both states, for the bandwidth process at the beginning of each GoP. 
By varying the mean state durations, this will allow us to analyze how the bandwidth 
variation frequency influences adaptive-rate source performance. The SBBP model 
we use for the bandwidth process at the beginning of each GoP is the following: 



Q 



(.OoP) 



(1-A/r) A/r 
A/r (i-A/r) 



(32) 



^{GoP) 



0.067 

0 



0.195 

0 



0.55 

0 



0.1 

0 



0.878 0 0 0 0 0 0 

0.03 0.07 0.123 0.154 0.18 0.37 0.073 



(33) 



where the value of A/T in (32) is the bandwidth changing frequency. We will 
analyze 8 cases, each featuring a different frequency, T = [l,2,3,4,5,6,7,8] GoP’s. The 
rows of are the pdf’s of each state of the bandwidth SBBP at the beginning of 
each slot. They are shown in Figs. 4b and 4c. Of course, their combination gives the 
bandwidth pdf shown in Fig. 4a. The mean values of the pdf of the two states are 1 .95 
and 7.79 packets/slot, while the overall mean value of the available network 
bandwidth is 4.87 packets/slot. 

Finally, the parameter set of the bandwidth SBBP can be found as follows: 
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(GoP) 

[ 1 ,:] 




B 



(GoP) 

[2,:] 




where: 



(34) 



• is the transition probability matrix for a circulant Markov chain [14], and 

has been introduced to take into account the fact that the bandwidth process 
remains constant for the whole duration of the GoP; its generic element is defined 
as follows: 

if i f = f + 1) or {/ = G, and f = l) (35) 

otherwise 



G (,ciRC,a,) _ 

if..n 



• 1q is the column array with elements are all equal to 1 . 

The choice of the matrices and implies that the mean network bandwidth 
alternates between the values 1.95 and 7.79 packets/slot, and the time interval with a 
constant mean network bandwidth has a mean duration of A/T . 

In Fig. 5 we have shown the mean value of both the output rate and the PSNR. We 
can observe that neither the mean value of the output process nor the mean PSNR are 
influenced by bandwidth variations. More specifically, the mean value of the output 
process is equal to the mean value of the available bandwidth, i.e. 4.87 packets/slot. 
This means that the feedback law we are using is able to follow bandwidth variations, 
thanks to the memory given by the counter. 

Instead, the statistics which are influenced by a change in bandwidth variation 
frequency regard the PSNR duration. In order to analyze this, Fig. 6 shows the mean 
PSNR duration of each PSNR level , for each I e [l,5] , introduced in Section 5.2, 

and any PSNR level. From this figure we can observe that, as expected, the mean 
duration of all the levels increases when the network bandwidth becomes more stable. 
Nevertheless, the best and the worst levels, (p^ and (p^ respectively, present the 

highest mean duration values, while the intermediate levels, (p^ and (p ^ , are transient 

levels and are less influenced by bandwidth variation frequency changes. Finally, 
from Fig. 6b we can obtain the maximum frequency beyond which the image quality 
become unstable, due to frequent changes in the PSNR level. 



7 Conclusions 

The co-existence of UDP-based applications with TCP-based applications constitutes 
a key issue for the Internet of the near future. The problem is to enhance UDP-based 
multimedia applications with some kind of congestion control, in order to make them 
behave like "good network citizen" at times of bandwidth scarcity. In this paper we 
have investigated the effects of the bandwidth variation rate during bandwidth profile 
variations on both the error of the rate controller in fitting the bandwidth profile, and 
the distortion introduced by the quantization mechanism of the MPEG video encoder. 
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(a) Mean output rate (b) Mean PSNR 

Fig. 5. Mean values of the performance metrics 





(a) For each level (b) Global 



Fig. 6. Mean PSNR duration 



To this end we have introduced an SBBPISBBPIl/K queueing system modeling an 
MPEG video source where a feedback law is used to provide rate adaptation. The 
proposed paradigm addresses any feedback law which takes into account both the 
state of a counter, used to keep track of the previous encoding history, and the activity 
level of the next frame to be encoded; by so doing the model can be applied whatever 
the feedback law is, provided that the parameter to be varied is the quantizer scale. 
Finally we have applied the proposed analytical framework to a real case, i.e. 
transmission of the movie “The Silence of the Lambs” on a TCP-friendly UDP 
connection. We have demonstrated that, thanks to the memory provided by the 
counter unit to the rate-controlled source, the output flow presents the same mean 
output rate as the network bandwidth. Flowever, when the network bandwidth 
variation frequency is high, the mean duration of each PSNR level is so short that 
perceived quality is unacceptable due to instability. 
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Abstract. The design of efficient unicast Internet video playback appli- 
cations requires proper integration of encoding techniques with transport 
mechanisms. Because of the mutual dependency between the encoding 
technique and the transport mechanism, design of such applications has 
proven to be a challenging problem. This paper presents an architec- 
ture which allows the joint design of a transport-aware video encoder 
with an encoding- aware transport. We argue that layered encoding pro- 
vides maximum flexibility for efficient transport of video streams over 
the Internet. We describe how off-line layered encoding techniques can 
achieve robustness against imprecise knowledge about channel behavior 
(ie., bandwidth and loss rate) while maximizing efficiency for a given 
transport mechanism. Then, we present our prototyped client-server ar- 
chitecture, and describe key components of the transport mechanism and 
their design issues. Finally, we describe how encoding-specific informa- 
tion is utilized by transport mechanisms for efficient delivery of stored 
layered video despite variations in channel behavior. 



1 Introduction 

The design of efficient unicast Internet video playback applications requires 
proper integration of encoding techniques with transport mechanisms. Because 
of the mutual dependency between the encoding technique and the transport 
mechanism, design of such applications has proven to be a challenging problem. 

Encoding techniques typically assume specific channel behavior (j. e., loss 
rate and bandwidth) and then the encoder is designed to maximize compression 
efficiency for expected channel bandwidth while including a sufficient amount 
of redundancy to cope with the expected loss rate. If channel behavior diverges 
from expected behavior, quality of delivered stream would be lower than ex- 
pected. The shared nature of Internet resources implies that behavior of Internet 
connections could substantially vary with unpredictable changes in co-existing 
traffic during the course of a session. This requires all Internet transport mech- 
anisms to incorporate some type of congestion control mechanism {e.g., P, 0). 
Thus to pipeline a pre-encoded stream through a congestion controlled connec- 
tion, video playback applications should be able to efficiently operate over the 
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expected range of connection behaviors. This means that video playback ap- 
plications should be both quality adaptive to cope with long-term variations in 
bandwidth and loss resilient to be robust against range of potential loss rates. 

In this paper, we argue that layered encoding is the most promising ap- 
proach to Internet video playback. Layered encoders structure video data into 
layers based on importance, such that lower layers are more important than 
higher layers. This structured representation allows transport mechanisms to 
accommodate variations in both available bandwidth and anticipated loss rate, 
thus enabling simpler joint designs of encoder and transport mechanisms. First, 
transport mechanisms can easily match the compressed rate with the average 
available bandwidth by adjusting delivered quality. Second, transport mecha- 
nisms can repair missing pieces of different layers in a prioritized fashion over 
different time scales. Over short timescales {e.g., a few round trip times (RTT)), 
lost packets from one layer can be recovered before any lost packet of higher 
layers. This allows transport mechanisms to control the observed loss rate for 
each layer. This is crucial because neither total channel loss rate nor its distri- 
bution across layers are known a priori and can be expected to change during 
transmission. Over long timescales {e.g., minutes), the server can patch quality 
of the delivered stream by transmitting pieces of a layer that were not deliv- 
ered during the initial transmission. Over even longer timescales, the server can 
send extra layers, to improve quality of a previously-transmitted stream without 
being constrained by the available bandwidth between client and server. This 
allows adjustment of quality for a cached stream at a proxy |2j. 

We have prototyped a client-server architecture (Figure P) for playback of 
layered video over the Internet. The congestion control (CC) module determines 
available bandwidth {BW) based on the network feedback {e.g., client’s ac- 
knowledgment). The Loss recovery (LR) module utilizes a portion of available 
bandwidth {BWir) to repair some recent packet losses such that the observed loss 
rate remains below the tolerable rate for the given encoding scheme. The server 
only observes the remaining losses that have not been repaired, and uses the 
remaining portion of bandwidth {BWga) to perform Quality adaptation (QA)P 
by adjusting delivered quality {i.e., number of transmitting layers). The QA 
and LR mechanisms each depend on parameters of the specific encoding and 
are tightly coupled. The collective performance of the QA and LR mechanisms 
determine the perceived quality of the video playback. A key component of the 
architecture is a Bandwidth Allocator (BA) that divides total available band- 
width between the QA and LR modules using information that depends on the 
specific encoding and on client status. 

This paper describes our ongoing work to integrate transport-aware encod- 
ing with encoding-aware transport for Internet video playback. We consider a 
coupled design, in which the encoder and transport are each designed given 
knowledge of the expected behavior of the other. For transport-aware encod- 
ing, we present the main design choices and trade-offs for layered encoding, and 
describe how the encoding schemes can be customized based on available knowl- 
edge regarding employed transport mechanism and regarding expected channel 
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Fig. 1. Client-server Architecture 



rate and loss rate. For the design of encoding-aware transport mechanisms, we 
focus on design strategies for the BA, QA and LR modules that are directly 
affected by details of the deployed layered encoding scheme. 

The rest of this paper is organized as follows. In section |3 we review some of 
the related work. Because of the natural ordering between encoding and trans- 
port, we consider transport-aware encoding using a layered video encoder in 
section El Then in section 0 we address various components of an encoding- 
aware transport mechanism and examine their design tradeoff. This includes 
Bandwidth allocation (section 14. 1 1 . Loss Recovery (section 14. 21 and Quality 
adaptation ((section 14.311 . Section El concludes the paper and addresses some of 
our future plans. 



2 Related Work 

Traditionally, video encoders have been designed for transport over fixed-rate 
channels (with fixed and known channel bandwidth) with few if any losses. This 
creates a bit stream that will have poor quality if transported over a network with 
different bandwidth or loss rate. Either a higher loss rate or a lower bandwidth 
would produce potentially significant visual artifacts that propagate with time. 
For transport over networks, the encoder design should change to be cognizant 
of the fact that the bit stream may have to deal with varying channel band- 
width and non-zero loss rate. Over the years, several classes of solutions have 
been proposed for encoding and transmitting video streams over the network as 
follows: 

— One-layer Encoding: In one-layer video encodings {e.g., MPEG-I and MPEG- 
2 Main Profile), the trade-off between compression and resilience to errors 
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is achieved by judiciously including Intra-blocks, which do not use tempo- 
ral prediction. The choice of which blocks to code as I-blocks for a given 
sequence can be optimized if channel loss rate is known a priori 0. 
Transport mechanisms can not gracefully adjust quality of a one-layer pre- 
encoded stream to available channel bandwidth. A common solution is to 
deploy an Encoding-specific packet dropping algorithm 0 . In these algo- 
rithms, the server discards packets that contain lower-priority information 
{e.g., drops B frames of MPEG streams) to match transmission rate with 
channel bandwidth. The range of channel rates over which these algorithms 
are useful is usually limited, and the delivered quality may be noticeably 
degraded. Both these effects are content and encoding specific. 

— Multiple description coders: Another approach is to use a multiple description 
(MD) video encoder 0. MD coders are typically more robust to uncertainties 
in the channel loss rate at the time of encoding. However, they still require 
similar network support to adapt to varying channel bandwidth. 

— Multiple Encodings: One alternative to cope with variations in channel be- 
havior is to maintain a few versions of each stream, each encoded for different 
network conditions. Then the server can switch between different encodings 
in response to changes in network conditions. Limitations of this approach 
are the inability to improve quality of an already-transmitted portion of 
the stream, and the inability of such a system to quickly respond to sharp 
decreases in available bandwidth. 

— Layered Eneodings: Hierarchical encoding organizes compressed media in a 
layered fashion based on its importance i.e., layer i is more important than 
all higher layers and less important than all lower layers. If the lower more 
important layers are received, a base quality video can be displayed, and if 
the higher less important layers are also received, better quality video can 
be displayed. The layered structure of encoded stream allows the server to 
effectively cope with uncertain channel behavior. 

In summary, layered encoding has three advantages: First, layered video al- 
lows easy and effective rate-matching. The server can match the bandwidth of 
delivered stream with the available network bandwidth by changing the number 
of layers that are transmitted. This relaxes the need for knowing exact chan- 
nel bandwidth at the time of encoding, thus it helps to decouple transport and 
encoding design. 

Second, layered video allows unequal error protection to be applied during 
transport, with stronger error correction applied to the more important layers. 
This can be used effectively even if the expected loss rate is not known at the time 
of compression. Thus, even though one-layer video and the most important layer 
of layered video are equally susceptible to losses, discrepancies between the actual 
loss rate and the one assumed at the time of encoding can be accommodated by 
the transport mechanism. 

Third, layered encoding allows the server to improve quality of an already- 
transmitted portion of the stream. We call this quality patching. In essence, 
layered structure allows the server to deliver different portions of various layers 
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in any arbitrary order, i. e., reshape the stream for delivery through uncertain 
channel. Thus quality of the delivered stream can be smoothed out over various 
timescales. Figure |2| illustrates the flexibilities of layered encoding to reshape the 
stream for transmission through the network. The server adjusts the number of 
layers when there are long-term changes in available bandwidth. Total loss rate is 
randomly distributed across the active layers. However, the server can prioritize 
loss recovery by retransmitting losses of layer i before layer j for any i < j. 
When extra bandwidth becomes available at time the server can either add 
the fourth layer or alternatively transmit five missing segments of the layer 2 
(between ti and ^ 2 )- This shows how a layered encoded stream can be reshaped 
for delivery through the network. 




Fig. 2. Flexibility of Layered Encoding 



3 Transport- Aware Layered Encoding 

In this section, we consider how best to create a video bit stream to be stored 
at the media server in Figure ^ We argued earlier that a layered (or in this 
paper, equivalently a scalable) encoder will produce a better bit stream than 
a one-layer encoder if the values of channel bandwidth and loss rate are not 
known at the time of encoding. Despite these advantages inherent in layered 
video, good system performance requires careful design. We begin with some 
background on layered video encoders, and then show how incorporating at the 
time of encoding as much information as possible about the channel bandwidth 
and loss rate can generate the best bit stream for storage. We also discuss the 
information that needs to be shared between encoder and transport to improve 
quality of delivered video. 

3.1 Layered Coding Background 

There are two basic methods to create a layered video bit stream: 3-dimensional 
wavelets [iSK) and adding layering to a traditional DCT-based video coder with 
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inter-frame temporal prediction pani. The former has drawbacks of poor com- 
pression performance in sequences with complex motions, and poor video quality 
when temporal sub-bands are lost due to motion blurring. For these reasons, we 
focus on the family of layered DCT-based temporally-predictive video coders in 
this paper. 

Among this family of coders, there has been much research and several stan- 
dardized coders. In these coders, P-frames are predicted from previous frames 
and the DCT coefficients are coded in a layered fashion. Several important as- 
pects of these coders are the following: 

— how the DCT coefficients are partitioned between each of the layers {i.e., is 
partitioning done in the frequency domain, the spatial domain, or the quality 
or SNR domain), 

— whether enhancement-layer information uses temporal prediction at all, and 
if so whether lower-layer base information is used to predict the higher-layer 
enhancement information, 

— how much bandwidth is allocated for each layer, 

— how much of that bandwidth is redundancy in each layer, 

— and how many layers are created. 

These decisions all affect the compression efficiency of the layered coder, and 
also affect the robustness against packet loss. Thus, knowledge of the expected 
bandwidth and loss rates at the time of transport can have a large impact on 
the best choice of codec design. 

While most standardized layered encoders produce two or at most a few layers 
mcDi, recent interest in having many layers has lead to a standard for MPEG- 
4 Finely Granular Scalability (FGS) jI2|. The bit stream structure of MPEG-4 
FGS makes it highly robust and flexible to changes in available bandwidth. 
However, such robustness comes with a significant penalty to the efficiency of 
the compression and hence to the video quality as the rate increases. More 
recent, not yet standardized approaches to layered encoding markedly improve 
the compression efficiency without too much sacrifice to the robustness m- 

The design of a layered encoder is based on underlying assumptions regarding 
the nature of the transport and channel. Specifically, a layered coder assumes 
that the transport mechanism will initially attempt to send the more important 
information in the available bandwidth, and that the loss recovery will be ap- 
plied to the more important parts before the less important parts. This implicitly 
requires buffering at the client and in the transport. However, the exact band- 
width and loss rate experienced at the time of transport is typically not known 
at the time of encoding, and further assumptions must be made. In general, 
these assumptions have been implicit in the design of the layered video encoder. 
Here we make them explicit. We consider first the bandwidth, then consider the 
loss rate. 

3.2 Incorporating Bandwidth Knowledge 

The more knowledge available at the time of encoding regarding the expected 
range of operating bandwidths, the better. We focus on how knowledge of the 



Design Issues for Layered Quality-Adaptive Internet Video Playback 439 



available bandwidth impacts the prediction strategy of the encoder. If the avail- 
able bandwidth is known to vary between Rmin and Rmax > three different predic- 
tion strategies are useful depending on the expected behavior of the bandwidth 
within this range. 

First, consider the case where the bandwidth is nearly always close to Rmax- 
Then the best design would be to rely heavily on prediction for the enhancement 
layers so as to improve compression efficiency. (This is essentially the strategy of 
a one-layer encoder.) Such a design will suffer small degradation when bandwidth 
dips slightly below Rmax, but in such an environment the probability of larger 
degradations is small. 

Second, consider the case where the bandwidth is usually near Rmin although 
we’d like better quality when the rate is higher. Then the best design would be 
to use temporal prediction only for the base layer with rate Rmin, and to use no 
temporal prediction for the higher layers. (This is the strategy used by MPEG-4 
FGS.) Such a design is very robust to variations in bandwidth in [Rmin, Rmax], 
but is also not very efficient at rates near Rmax- 

Third, suppose we have little knowledge of the bandwidth other than it lies 
in the range [Rmin, Rmax]- In this case, the best algorithm is to judiciously 
choose the prediction strategy for each macro block in the video, so as to balance 
both compression efficiency (at rates near Rmax) and robustness (at rates near 
Rmin)- FigureElillustrates these concepts for the sequence Hall monitor using the 
scalable video coder in H2]. This coder has the flexibility of using three different 
methods of prediction for the different layers, which allows it to mimic both a 
one-layer video coder and the MPEG-4 FGS video coder through an appropriate 
restriction of the prediction methods. The results in Figure 0 are obtained by 
creating a single bit stream for each illustrated curve, and successively discarding 
enhancement-layer bit-planes and decoding the remainder. The x-axis shows 
the decoded bit-rate, and the y-axis shows the PSNR of the resulting decoder 
reconstruction as bit-planes are discarded. Also shown is the performance of the 
one- loop encoder with no loss (top dotted line). This provides an upper bound 
on the performance of the scalable coders. 

In this figure, the curve labeled “FGS” uses the prediction strategy of the 
MPEG-4 FGS coder which is optimal if the available bandwidth is usually near 
Rmin- The curve labeled “one-layer with loss” uses a one-layer prediction strat- 
egy, which is optimal if the available bandwidth is usually near Rmax - The curve 
labeled “drift-controlled” is our coder H3| optimized to provide good perfor- 
mance across the range of rates. 

The FGS coder performs poorly at the higher rates. The one-layer decoder 
with drift suffers a 2. 6-4. 3 dB degradation at the lowest bit-rate, compared to the 
drift-free FGS decoder. Relative to the FGS coder, our proposed coder suffers 
about 1.3-1. 4 dB performance degradation at the lowest bit-rate, but significantly 
outperforms it elsewhere. Our coder loses some efficiency at the highest rates 
compared to the one-layer coder, but has noticeably less drift as bit-planes are 
discarded. 
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Fig. 3. Effect of Bandwidth on PSNR 



Table □ shows the PSNR averaged across different channel rates, assuming a 
uniform distribution of rates between the smallest and the largest rate of the one- 
loop encoder. The coder optimized for the range of channel rates outperforms 
the other coders by 0.8-2. 1 dB when there is only one I-frame. 



Table 1. PSNR assuming uniform distribution. 
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3.3 Incorporating Loss Rate Knowledge 

Next, we consider the effect of incorporating knowledge of the expected loss rate 
into the video encoder design. We distinguish loss rate from lowered bandwidth 
because losses will randomly affect the entire frame for a given layer, while 
lowered bandwidth can be targeted specifically toward less important parts of 
the bit stream within a frame for a given layer. 

In the Figure El we showed how the choice of prediction structure used by the 
layered encoder is influenced by the expected bandwidth. Similarly, the expected 
loss rate affects the best prediction structure. Figure 0 shows the performance 
of a two-layer H.263 encoder under random losses in each layer. Two different 
prediction strategies are used for the enhancement layer. The base layer is com- 
pressed with 64 Kbps, and each enhancement layer is compressed with 128 Kbps. 
The curve labeled “Enh 128” corresponds to an enhancement layer which is pre- 
dicted only from the base layer, while the curve labeled “Enh 128p” corresponds 
to an enhancement layer that also uses prediction from previous enhancement- 
layer pictures. Performance for “Enh 128” and “Enh 128p” assume the base layer 
is completely received. 
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For the base layer, performance degrades gradually as the loss rate increases 
from 0.1% to 1%, but as the loss rate continues to increase, the performance 
degrades significantly. Thus it will be important for the transport mechanism to 
keep the residual losses (after loss recovery) below 1% for the base layer. 

The more aggressive prediction strategy of “Enh 128p” has visually better 
performance for low loss rates, but for loss rates greater than about 5%, performs 
significantly worse than the less efficient prediction strategy of “Enh 128”. In 
both cases, performance with 100% loss of enhancement layer is identical to the 
base-only performance. If the more aggressive prediction strategy is used, then 
it will be important for the transport to keep the loss rate for the enhancement 
layer below 10%, while if the less efficient prediction strategy is used, the loss 
rate for the enhancement layer is less critical. However, performance will be 
significantly degraded in the case of few losses, or if additional layers are added 
beyond this first enhancement layer. 

3.4 Information Provided to Transport 

To enable the best design of an encoding-aware transport, the encoder should 
provide some meta-information to the transport mechanism. This meta-informat- 
ion includes two pieces of information: 

1. The bandwidth-quality trade-off {e.g., Figure OJ. More specifically, the en- 
coder conveys the bandwidth (or consumption rate) of each layer {i.e., Co, 
Cl, ..., Cat), and the improvement in quality caused by each layer {i.e., Qi, 
Q 2 , ..., Qn)- 

2. The loss rate and quality trade-off {e.g., Figure^. In some situation (such 
as for the base-layer or the more aggressive enhancement-layer prediction 
strategy in Figure HD the loss rate vs. quality meta-information can be re- 
duced to simple thresholds regarding the maximum tolerable loss rate for 
each layer {i.e., L^naxO: ^maxl, ■■■, ^maxN') ■ 




Fig. 4. Effect of Loss on PSNR 
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In general, this meta-information will be encoding and even content specific. 
Ideally, it should also include temporal variations, indicating the trade-ofls for 
different scenes within a sequence, or for each frame in a sequence. However, in 
practice, it might be necessary to use static information (for the entire sequence) 
or generic information (for a class of sequences, like “head-and-shoulders”). 

The interface between the transport and the encoder is completely character- 
ized by the loss rate and the bandwidth dedicated to video information. There- 
fore, the above information will be sufficient to design effective encoding-aware 
transport mechanisms. 



4 Encoding- Aware Transport 

In this section, we illustrate how a transport mechanism for layered encoded 
streams can leverage encoding-specific information to improve quality of deliv- 
ered stream. First, we present our client-server architecture to identify the main 
components of the architecture and their associated design issues. Then, we ex- 
plore the design space of the main components to show how encoding-specific 
information can be used to customized the design. Our goal is to clarify key 
tradeoffs in the design of a transport mechanism for layered video and demon- 
strate how they can use information about encoded streams to improve delivered 
quality over the best-effort Internet. 

Figure 0 depicts our client-server architecture for delivery of Internet video 
playback (Figure with more details. As we mentioned earlier, the architecture 
has four key components: 1) Congestion Control (CC) is a network-specific mech- 
anism that determines available bandwidth {BW) and loss rate (L) of the net- 
work connection. Available bandwidth and loss rate are periodically reported to 
the Bandwidth Allocation (BA) module. The BA module uses encoding-specific 
information to properly allocate the available bandwidth between the Loss Re- 
covery (LR) and the Quality Adaptation (QA) modules. The LR module utilizes 
allocated portion of available bandwidth (BWir) to retransmit the required ratio 
of recent losses. The remaining portion of available bandwidth (BWqa) is used 
by the QA module to properly determine the quality of delivered stream (i.e., 
number of transmitting layers) . 

All the streams are layered encoded and stored. Thus, different layers can be 
sent with different rates {bwo, hw\, ..., bwn)- The server multiplexes all active 
layers along with retransmitted packets into a single congestion controlled flow. 
The client demultiplexes different layers and rebuilds individual layers in virtu- 
ally separate buffers. Each layer’s buffer is drained by the decoder with a rate 
equal to its consumption rate (ie., Cq, Ci, ..., C„). The client reports its playout 
time in each ACK packet. This allows the server to estimate the client’s buffer 
state, i.e., the amount of buffered data for each layer. Client buffering is used 
to absorb short-term variations in bandwidth without changing the delivered 
quality. 

The main goal of the transport mechanism is to map the actual connection 
bandwidth and loss rate into the range of acceptable channel behavior expected 
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Fig. 5. Client-Server Architecture for Streaming of Layered Video 



by the encoding mechanism. Layered encoding provides two levels of freedom for 
the transport mechanism to achieve this goal: 1) by changing number of layers, 
the transport mechanism can adjust the required channel bandwidth for delivery 
of the stream, and 2) by allocating a portion of available channel bandwidth to 
loss repair, the transport mechanism can reduce the observed loss rate. Therefore, 
there are three key issues in design of a transport mechanism for layered encoded 
streams that can be tailored for a given encoded stream to improve delivered 
quality: 

— Bandwidth Allocation strategy: How should the transport mechanism allocate 
total connection bandwidth between LR and QA modules? 

— Loss Repair strategy: How should the loss repair bandwidth {BWir) be shared 
among transmitting layers? 

— Quality Adaptation strategy: How should the server adjust the quality of 
delivered stream {i.e., the number of layers) as available channel bandwidth 
(BWqa) changes? 

Since congestion control is a network-specific mechanism, its design should not be 
substantially affected by application requirements. Therefore, we do not discuss 
design issues of the CC mechanism. The main challenge is that the behavior of a 
network connection (BW and L) is not known a priori, and even worse it could 
substantially change during the course of a connection. Thus, the server should 
adaptively change its behavior as the connection behavior varies. 
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For the rest of this section, we provide insight in each one of the above three 
strategies in design of transport mechanism for layered encoded stream and 
demonstrate how the transport mechanism can benefit from encoding-specific 
information. We assume that the encoding-specific meta-data (described in sec- 
tion E3) are available for each layered encoded stream. 



4.1 Bandwidth Allocation 

The BA module shifts the connection loss rate (L) into the range of acceptable 
loss rate by allocating the required amount of connection bandwidth for loss 
repair. Therefore, the application will observe a channel with a lower loss rate 
{Lqa) at the cost of lower channel bandwidth. We need to drive a function that 
presents the tradeoff between the channel bandwidth {BWqa) and the channel 
loss rate {Lqa). Given the connection bandwidth {BW) and the connection loss 
rate (L), the total rate of delivered bits is equal to BW{1 — L). Therefore, the 
ratio of delivered bits for the channel is ■ Thus, we can calculate the 

channel loss rate as follows: 

Lqa = l- (1), where BWqa < BW and BWqa > BW{1 - L) 

Equation (1) presents Lqa as a function of BWqa for a given connection (ie., 
BW, L). Figure El depicts this function for different set of BW and L values. 
Each line in Figure El represents possible channel behaviors for a given network 
connection as the BA module trades BWqa with Lqa- For example, point A 
represents a connection with 1000 Kbps bandwidth and 40% loss rate. To re- 
duce the channel loss rate down to 33% {i.e., shifting point A to point B), the 
BA module should allocate 100 Kbps of connection bandwidth for loss repair, 
whereas reducing the loss rate down to 14% {i.e., shifting point A to point C) 
requires the BA module to allocate 300 Kbps of the connection bandwidth for 
loss repair. 

Figure El clearly demonstrates how the BA strategy can be customized for 
a given encoded streams using the information provided by the encoder. Given 
the bandwidth of various layers {i.e., Co, Ci, ...) and the per-layer maximum 
tolerable loss rates {i.e., Lmaxo, Lmaxi, •••)) to find the maximum number of 
layers (n) that can be delivered through a network connection {BW, L), the 
following two conditions should be satisfied: 

(2), EtoC^< BWqa (3) 

The first condition ensures that the average loss rate for n active layers is less 
than the channel loss rate, whereas the second condition ensures that channel 
bandwidth is sufficient for delivery of n layers. Given the values of BW, L and 
n = N (where N is maximum number of layers), the BA module should use 
equation (1) to search for a channel loss rate that satisfies equation (2). If such a 
channel loss rate can be accommodated while the corresponding channel band- 
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width satisfies equation (3), then n layers can be delivered and total required 
bandwidth for loss repair is {BWir = BW - BWqa)- Otherwise, the BA module 
decreases n by one and repeats this process. 

The BA module continuously monitors the connection behavior to determine 
the required bandwidth for loss repair such that the channel behavior always 
satisfies the conditions (equation 2 and 3) for the number of active layers. If 
the connection loss rate increases or the connection bandwidth decreases such 
that these conditions can no longer be satisfied, the BA module signals the QA 
module to drop the top layer. This decreases n which in turn presents a new set 
of conditions to the transport mechanism. 



Trading Channel Bandwidth with Channel Loss Rate 




Channel Loss Rate 



Fig. 6. Trading the channel bandwidth with the channel loss rate 



In summary, the BA module determines the bandwidth share for the QA 
and LR modules. This allows us to separate the design of the Loss recovery 
mechanism from the Quality adaptation mechanism in spite the fact that their 
collective performance determines delivered quality. 



4.2 Loss Recovery 

The loss repair module should micro-manage the total allocated bandwidth for 
loss repair {BWir) among the active layers such that the loss rate observed by 
each layer remains below its maximum tolerable threshold (z.e., LmaxOi L^axi, 
..., LjnaxN)- Since all layers are multiplexed into a single unicast session at the 
server, the distribution of total loss rate across active layers is seemingly random 
and could change in time. Thus, the bandwidth requirement for loss recovery of 
various layers can randomly change in time even if the total loss rate remains 
constant. We assume a retransmission-based loss recovery since 1) retransmission 




446 



R. Rejaie and A. Reibman 




Fig. 7. Sliding Window approach to prioritized Retransmission 



is feasible for playback applications with sufficient client buffering, 2) retransmis- 
sion is more efficient (i. e., requires less bandwidth) than other repair schemes 
such as FEC, and retransmission allows fine-grained bandwidth allocation 
among active layers |j. 

Since the importance of a layer monotonically decreases with its layer num- 
ber, loss repair should be performed in a prioritized fashion, f. e., losses of layer j 
should be repaired before losses of higher layers and after losses of lower layers. 
However, a prioritized approach to loss repair should ensure that the total repair 
bandwidth is properly shared among active layers and that each retransmitted 
packet is delivered before its playout time. To achieve this, we deploy a sliding- 
window approach to loss repair as shown in Figure 0 At any point of time, the 
server examines a recent window of transmitted packets across all layers. Losses 
of active layers are retransmitted in the order of their importance such that the 
loss rate observed by each layer remains below its maximum tolerable loss rate 
(i.e., Lmaxi)- Figure Cl shows the order of retransmission within a window for 
each layer and across all layers. 

The repair window should always be a few round-trip-times (RTT) ahead 
of the playout time to provide sufficient time for retransmission. Therefore, the 
repair window slides with playout time. If the BA module properly estimates 
the required bandwidth for loss repair, all losses can be repaired. However, if the 
allocated bandwidth for loss repair is not sufficient to recover all the losses within 
a window, this approach repairs the maximum number of more important losses. 
The length of the repair window should be chosen properly. A short window 
cannot cope with a sudden decrease in bandwidth, whereas a long window could 
result in the late arrival of retransmitted packets for higher layers. In summary, 
the sliding window approach to prioritized loss repair 1) uses maximum per- 
layer tolerable loss rates for an encoded stream to improve its performance, and 
2) adaptively changes the distribution of total repair bandwidth among active 
layers. 

^ Although we only discuss retransmission-based loss repair, the basic idea can be 
applied to other post-encoding loss repair mechanisms such as unequal FEC. 
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4.3 Quality Adaptation 

The QA mechanism is a strategy that adds and drops layers to match the qual- 
ity of the delivered stream (he., number of transmitting layers) to the channel 
bandwidth {BWqa). When the channel bandwidth is higher than the consump- 
tion rate for active layers, the server can use the extra bandwidth and send 
active layers with a higher rate (he., bwi > Ci) to fill up the client buffer. The 
buffered data at the client can be used to absorb a short-term decrease in band- 
width without dropping any layers. Figure El illustrates the filling and draining 
phases of the client buffers where three layers are delivered. If the total amount 
of buffered data during a draining phase is not sufficient to absorb the decrease 
in bandwidth, the QA module is forced to drop a layer. Consequently, the more 
data that is buffered at the client during a filling phase, the bigger the reduc- 
tions that can be absorbed. The amount of buffered data at the client side is 
determined by the strategy of adding layers. 

Figure 0 compares two adding strategies. During a filling phase, the QA 
strategy adds a new layer after a specific amount of data is buffered. If the 
required amount of buffered data is small, the QA mechanism aggressively adds 
a new layer whenever the channel bandwidth slightly increases. In this case, any 
small decrease in bandwidth could result in dropping the top layer because of the 
small amount of buffering. Alternatively, the QA mechanism can conservatively 
add a new layer only after a large amount of data is buffered. 







Fig. 8. Filling and Draining phases for Quality adaptation 



More buffered data allows the server to maintain the newly added layer for 
a longer period of time despite major drops in bandwidth. Figure ^0 shows 
an aggressive and a conservative adding strategies in actionPJ. The congestion- 
controlled bandwidth is shown with a saw tooth line in both graphs. This experi- 
ment clearly illustrates the coupling between the adding and dropping strategies. 
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Channel Aggressive QA 




Fig. 9. Effect of adding strategy on quality changes 



The more eonservative the adding strategy, the longer a new layer ean be kept, 
resulting in fewer quality changes, and vice versa. 





Fig. 10. Aggressive vs Conservative QA Strategies 



The QA mechanism should be customized for the particular layered stream 
being transmitted. To explain this, we need to examine two basic tradeoffs in 
the design of an add and drop strategy. 

— How much data should be buffered before adding a new layer? 

This should be chosen such that normal oscillation in channel bandwidth in 
the steady state does not trigger either adding or dropping a layer. Figure 0 
clearly shows that the required amount of buffered data to survive a drop in 
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bandwidth directly depends on the total consumption rate of active n layers 

(*-e.,Er=oC.). 

— How should the buffered data be distributed among active layers? 

Since the streams are layered encoded, buffered data should be properly 
distributed across all active layers in order to effectively absorb variations 
in bandwidth. During a draining phase, buffered data for layer i cannot 
be drained faster than its consumption rate (Q). Therefore, buffered data 
for layer i can not compensate more than Ci bps. Figure HD illustrates this 
restriction. To avoid dropping a layer, the total draining rate of buffering 
layers {e.g., C 2 + C\) should always be higher than the deficit in channel 
bandwidth {BWdef)- More specifically, during a draining phase the following 
conditions must be satisfied: 

BWdef = J2i=0 “ BWqa, BWdef < J2i^Buf Layers 



i ;;;;i Amount of Drained Data from Layer i 
I;;;; | Amount of Drained Data from Layer j 

I I Amount of Drained Data from Layer k 



Draining Phase 






C 

CQ 

m 



Ci 



■'Ckill': 



Time 



Fig. 11. Sample distribution of buffered data among buffering layers 



5 Conclusion and Futnre Work 

In this paper, we presented a joint design of encoder and transport mecha- 
nisms for playback of layered quality-adaptive video over the Internet. The main 
challenge is that the Internet does not support QoS. At the time of encoding, 
the channel bandwidth and loss rate are not known and they could significantly 
change during the course of a session. Therefore, traditional encoding approaches 
that assume static channel behavior will result in poor quality. 

We argued that layered video is the most flexible solution because 1) it can 
be efficiently delivered over a range of channel behavior, and 2) it provides suf- 
ficient flexibility for the transport mechanism to effectively reshape the stream 
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for delivery through the variable channel. However, to maximize quality of de- 
livered video, the encoding should become transport-aware and the transport 
mechanism should become encoding-aware. Toward this end, we described sev- 
eral issues in the design of layered encoding mechanisms and explained how 
the expected range of channel bandwidth and loss rate information can be in- 
corporated into the encoding mechanism. Furthermore, the encoding mechanism 
provides encoding-specific meta-information. The transport mechanism uses this 
information to bridge the gap between the expected range of channel behavior 
by the encoder and the actual connection behavior. More specifically, we pro- 
vided insight on how the main components of the transport mechanism, partic- 
ularly Bandwidth Allocation, Loss Repair and Quality Adaptation, can leverage 
encoding-specific meta-information to improve the delivered quality of layered 
video despite unpredictable changes in channel behavior. 

Finally, we plan to conduct extensive experiments over the Internet to eval- 
uate the overall performance of our client-server architecture. This will allow us 
to identify those scenarios in which our architecture can not properly cope with 
changes in connection behavior. Some of these problems can be addressed by the 
encoding mechanism through appropriate provisioning, whereas others require 
further tuning or modification of the transport mechanism. Our experiments 
should provide deeper insight about channel behavior that may suggest refine- 
ment of the layered encoding mechanism. We also plan to examine interactions 
among three key components of the transport mechanism, as well as implications 
of congestion control algorithm on other components of transport mechanism. 
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Abstract. In this paper a multi-state rate adaptation scheme for rate adaptable 
continuous media (CM) sources is presented; the network bandwidth availabil- 
ity information is assumed to be communicated to the source only indirectly, 
through a notification of the packet losses. The developed scheme is capable of 
detecting and responding well to static (due to access bandwidth limitations) 
and long-term (due to initiation/termination of CM flows) bandwidth availabil- 
ity, as well as locking the rate of a CM flow to an appropriate value. Its per- 
formance is compared with that under a classic Additive Increase Multiplicative 
Decrease algorithm and its shown to perform better in terms of a number of 
relevant metrics, such as packet losses, oscillatory behavior, and fairness. 



1. Introduction and Background 

The network layer of Internet currently provides no inherent congestion control. In 
current IP environments, there is no central entity to dictate an explicit rate for the 
flows or explicitly indicate the current network state like, for example, in an ATM 
environment; ECN (Explicit Congestion Notification) based schemes are still far from 
being adopted and widely deployed [1,2]. Thus, congestion control may be provided 
through mechanisms implemented either by the transport or the application layers. A 
flow source would use its own application or transport level information concerning 
packet losses to infer the state of the network, resulting in a distributed rate control 
scheme. Different users may infer a different network state. The Transport Control 
Protocol (TCP) protocol provides congestion control, whereas the User Datagram 
Protocol (UDP) does not. 

One of the emerging services appearing in the Internet is the continuous media 
(CM) streaming service, such as video streaming, conferencing and “on demand” 
services. CM applications require more bandwidth, compared to the typical data 
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block transfer services (e.g., Web access over TCP, file transfer via FTP, e-mail), and 
are sensitive to the network induced delay, delay jitter and packet loss. 

TCP is not suitable for multimedia transmission, because of its window-based con- 
gestion scheme [3,4,5]. For CM services UDP is used as the base transport protocol, 
in conjunction with the Real Time Protocol (RTP) and Real Time Control Protocol 
(RTCP) [6]. RTP provides end-to-end delivery services for real-time data, such as 
interactive audio and video. Those services include payload type identification, se- 
quence numbering, time-stamping and delivery monitoring. RTP itself does not pro- 
vide any mechanism to ensure the timely delivery of media units nor provides for 
other quality-of-service guarantees, but rather relies on lower-layer services to do so 
(i.e., integrated or differentiated services applied at the network layer). RTP carries 
data with real-time properties, whereas RTCP monitors the quality of service of each 
participant and conveys useful information to all participating entities. Applications 
that use the UDP transport protocol have to implement end-to-end congestion control 
[7] otherwise (a) the TCP traffic suffers and (b) the network collapses. RTCP may be 
used as a feedback channel to provide for congestion notification to CM applications 
that implement an end-to-end congestion control scheme [8,9]. Examples of 
RTP/RTCP based rate control algorithms, that are TCP-friendly, are [10-14]. Other 
rate control protocols, such as RAP [15] and TFRCP [16], that are TCP-friendly and 
do not rely on RTP/RTCP, have been proposed in the literature as well. 

Rate control protocols and adaptation mechanisms require sources capable of ad- 
justing their rate on the fly (such as MPEG-4 [17]) or transcoding filters and video 
gateways [18,19,20] capable of transcoding a high bit-rate input stream (such as 
MPEG-2 [21]) to a lower bit-rate adjustable output stream. An alternative to rate 
control and transcoding mechanisms is the multi-layered coding with multicast trans- 
fer scheme [22]. Such a scheme allows the receiver to connect to the appropriate 
multicast channel(s) according to its connection bandwidth availability. 

A rate adaptation scheme needs to be communicated the network state through a 
feedback mechanism. Most rate adaptation schemes use the current packet loss rate to 
infer whether the network is congested or not, and the rate is either increased or de- 
creased, respectively. The increase function is either additive increase [23] or multi- 
plicative increase [16], whereas the decrease function is multiplicative decrease 
[8,9,15] or the rate may be set to a value calculated by an equation-based formula 
[24] which estimates the equivalent TCP throughput [10-14,16]. The various feed- 
back mechanisms are either RTCP-based [8-14] or ACK-based [15,16]. In the first 
case the feedback delay is at least 5 sec (± 1.5 randomly determined), whereas in the 
second case it is equal to one round trip time. A congestion control scheme that uses 
the history of the packet losses has been proposed as well [25]. 

In this paper a rate adaptation scheme is introduced for an environment supporting 
CM flows. As discussed briefly below - as well as throughout this paper - the scheme 
will attempt to take into consideration certain peculiarities associated with a CM- 
supporting networking environment. 

Users may be connected to the network using links of limited and fixed bandwidth 
(e.g., dial-up users, ADSL users), or practically unlimited bandwidth. The bit-rate of a 
CM flow is assumed to take any value within a specific range, where a minimum rate 
is considered in order to ensure a minimum perceptual quality, whereas a maximum 
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rate represents the tipper limit of the source encoder capability or is imposed to pre- 
vent unnecessarily high consumption of bandwidth and /or prevent the suppression of 
other non CM flows, e.g., TCP flows. The (limited) access bandwidth and the mini- 
mum and maximum source encoding rates all impose restrictions on the behavior of 
the source rate adaptation scheme and should be taken into consideration. For in- 
stance, probing for more bandwidth availability when the (static) access bandwidth 
limit is reached should be avoided by an effective rate adaptation scheme. 

CM flow initiations and terminations could result in large and lasting bandwidth 
availability changes. Such type of changes should be distinguished from (small and 
short-term) typical bandwidth of fluctuations and should trigger a rapid response from 
the rate adaptable CM flows to avoid excessive losses or bandwidth under-utilization. 
Because of the nature of the CM applications, decreasing the rate upon congestion by 
a factor of 2 (as in TCP) should be avoided, as the perceptual QoS would suffer sub- 
stantially. Fast convergence to a fair bandwidth allocation is desirable after a flow 
initiation or termination occurs. 

The rate adaptation scheme proposed in this paper is designed for a networking 
environment supporting CM flows and is capable of detecting and responding effec- 
tively to static and long-term bandwidth availability. The CM flows are supposed not 
to be aware of any access bandwidth limitations or the number of co-existing CM 
flows or the overall network load. The developed scheme detects any access band- 
width limitations and locks the rate of the flow to an appropriate rate, as well as de- 
tects lasting load changes (due to the initiation or termination of CM flows) and locks 
the rate of the CM flows to an appropriate rate. Short-term bandwidth changes are 
not expected to be present in a (purely) CM supporting networking environment due 
to the nature of the Adaptable Constant Bit Rate (A-CBR) CM applications consid- 
ered here. While an extension of this work will consider an environment where A- 
CBR CM flows co-exist with other traffic (such as TCP) which introduces short-term 
bandwidth changes, the effort there is expected to be focused on the study of the 
impact of such bursty traffic on the proposed rate adaptation algorithm and the intro- 
duction of the adjustments to improve its effectiveness, rather than the redesign of the 
algorithm to become responsive to such short term bandwidth fluctuations. To follow 
the latter fluctuations may be impossible due to the granularity and the hysterisis of 
the encoding process and the rate adaptation time lag, as well as be of negligible im- 
pact on the perceived QoS. 

In this paper, a multi-state congestion control scheme for rate adaptable unicast 
CM flows is presented. This scheme is suitable for an environment in which the net- 
work bandwidth availability information is communicated to the source only indi- 
rectly through a notification of the packet losses that the specific source suffers over a 
time interval. This environment is present, for example, when CM flows are transmit- 
ted by using RTP/UDP/IP protocols. The proposed scheme aims to provide a finer 
rate adaptation decision space for the source by introducing a number of flow states 
and then basing the rate adaptation decision on the current source state in addition to 
the current packet loss feedback information. The introduction of states allows for an 
effective “summarization” of the recent bandwidth availability and adaptation history. 
The resulting multi-state rate adaptation algorithm has a larger decision space com- 
pared to traditional schemes, and responds well to a diversity of network situations. 
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such as those outlined above, while exhibiting good fairness characteristics. The be- 
havior of the proposed scheme is examined in an environment where only A-CBR 
CM (RTPAJDP/IP) flows are transmitted (i.e., free of TCP flows), as it may be the 
case under the forthcoming differentiated services. 



2. Description of the Algorithm 

A networking environment supporting N RTP/UDP/IP CM flows is considered. The 
RTCP protocol is used as a feedback notification mechanism as specified by RFC 
1889 [7]. The source and receiver(s) of each flow exchange RTCP Sender and Re- 
ceiver Reports (SR and RR, respectively). The sources are assumed to be capable of 
adapting their encoding and transmission rates on the fly, as it is the case. The source 
of a flow is not aware of the topology, the links, the available bandwidth and the 
current number of CM flows transmitted over the network. The only available infor- 
mation is the one carried by the RTCP RR reports. One RR report is sent every T 
seconds. In this paper T is set to be equal to 5 sec, as specified by the RFC 1889. 

A decision module - located at the source - is responsible for deciding on the ap- 
propriate transmission rate to which the source should adapt. Rate adaptation deci- 
sions are taken at the adaptation points. An adaptation point occurs when a RR ar- 
rives at the source of the flow or a time-out timer expires indicating that at least one 
RR report is excessively delayed or lost. The time-out timer is initially set to the value 
of timeout_period (i) - for flow i - which is set to a value To (i) greater than T 
to absorb (in part) the network introduced delay jitter. This counter is reset to To (i) 
each time a RR is received and to Ti(i), T<Ti ( i ) <Tq ( i ) , each time a timeout 
occurs. This subsequent reduction in the time-out period will lead to a reduction of 
the time to the next adaptation point if this is due to a time-out and, thus, will lead to 
a faster and more effective rate adaptation. 



Let /(i) denote the network state feedback associated with the C* adaptation point 
of flow i.f(i)= ? (i.e., no value) if the k"' adaptation point is due to a timer expira- 
tion and/(i) = p''(i) when it is due to the arrival of a RR; p''(i) denotes the packet loss 
rate derived from the RR received at the A:'* adaptation point. Let S^( i ) and r( i ) denote 
the state - to be defined later - and rate of flow i, just before k"" adaptation point. At 
the k"' adaptation point a state transition (adaptation) from S'‘(i) to state S'‘*'(i) and rate 
transition from r(i) to r*'(i) will occur, effective just after the k"" adaptation point and 
remaining effective until just before the (k+lf (Fig.l). In addition to the received- 
feedback/(i), the next state ^*‘(i) depends on S'‘(i) as well as a number of other quan- 
tities defined at each adaptation point providing useful history information. Such 
quantities are the recent average packet loss rate p'‘j^i) - defined precisely later -, the 
number of successive, t,^Ji), and total, f„„Ji), visits to the current state S‘((), for 
certain states, and the binary lock decision function B^^Ji) that returns 1, if rate lock- 
ing is decided. The next rate r*'(i) depends on the next state S'‘*'(i) decided to be 
visited, the current rate r(i), and the feedback/(i). 
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(S'‘(i), r^(i)) 



X X 




f(i) k f+‘(i) k+1 
Fig. 1. State and rate transitions. 



Let 9 denote the state space of the process {S‘'"YO}k>o> 9= {just_intro, 
decr_due_nof eedback, increase, decrease, lock, rmax,rmin, 
fast_increase, f ast_decrease , check_f or_bw, low_increase, 
low_decrease , low_lock, low_check_f or_bw}. A brief description of 
the states and the motivation behind their introduction are presented next. 

Upon initiation, a flow enters state just_intro, allowing the new source to fol- 
low slightly different increase/decrease policy for a given feedback compared to ex- 
isting flows. This is necessary since the initial rate of a flow is randomly selected 
within a range and thus different level of up or down adjustments are needed com- 
pared to older flows which have already adjusted their rate and need to respond to 
bandwidth availability changes. 

If a rate adaptation point is due to the time-out timer expiration it is considered that 
at least a feedback is lost or excessively delayed and the decr_due_nof eedback 
state is visited while the rate is properly decreased. 

States max and rmin are defined to capture the state of a source that attempted 
to select a rate beyond the corresponding rate limits. When a flow is in one of these 
states the source will apply such a control that the flow rate remain in the range 
[r^in (i) , rmax til ] . Also, visiting the state rmin once is an indication of a highly 
congested network, whereas visiting this state more times is an indication of persist- 
ing congestion, probably due to access network limitation. 

States increase and decrease are defined for the normal case in which the 
flows compete with each other in order to share the available bandwidth. The flow 
enters state increase in order to increase its rate, whereas it enters state de- 
crease to decrease the rate in response to detecting some minor congestion. 

State f ast_increase is for the case that the algorithm estimates that the net- 
work is rather seriously under-utilised, possibly due to bandwidth release caused by a 
CM flow termination(s). The transition to this state is accompanied with a rate in- 
crease that is considerably greater than that of state increase. State 
f ast_decrease is for the case that the algorithm detects rather serious network 
congestion due to, for instance, the initiation of some new CM flow(s) and results in a 
rate decrease that is more drastic than that of state decrease. 

The algorithm is capable of detecting situations where the flows oscillate around 
the fairness point or around its access bandwidth limit and could optionally lock the 
rate. The state lock is defined for this purpose. As long as the flow is in state lock 
the rate is maintained. After spending some time in state lock, the state 
check_for_bw is visited and bandwidth probing rate increases take place, every 
time this state is (re)visited. 
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States low_increase, low_decrease, low_lock, and 
low_check_f or_bw are defined to allow for a milder reaction, which is more 
appropriate when it is highly possible that the congestion is due to access network 
limitations, near the r^in { i ) value. 

Under no feedback (f(i) =0) - that is the time-out timer expires - the state 
decr_due_nof eedback is visited from any state and the rate is decreased as de- 
scribed in section 3. If the number of consecutive visits to this state exceeds a thresh- 
old (max_cons_times_in_decrnofeed (i)) then the state rmin is visited. 



2.1 State Transitions under Zero Packet Loss Rate Feedback {f‘(i)=p‘(i) = 0) 

Under zero-Packet Loss Rate (zero-PLR) feedback the state transitions shown in 
Table 1 take place under the conditions specified. Such transitions are typically fol- 
lowed by a rate increase, as determined by the state visited and described in section 3. 
Some of the state transitions are self-justifiable in view of the comments provided for 
the various states in the previous section. 

When a transition to a state is accompanied by a rate increase that results in a rate 
exceeding r^ax til, then the rate is set to r^ax { i ) and an instantaneous transition to 
state rmax occurs. State max may be visited from any state and transitions to max 
are not shown on Table 1 for simplicity. 

Since the initial rate of a new flow (entering state just_intro) is randomly se- 
lected, a zero-PLR feedback implies that that selection has most likely been very 
conservative and, thus a transition to state f ast_increase (as opposed to in- 
crease or low_increase) occurs. 

Since state decr_due_no_f eedback is visited upon time-out timer expiration 
(indicating delayed or lost feedbacks) and the rate is decreased, a zero-PLR feedback 
indicates that the previous rate decreases that occurred (upon visiting this state) have 
been rather excessive and a transition to state f ast_increase occurs to allow for a 
fast rate “correction” (increase). 

Upon zero-PLR feedback the algorithm (process) remains in (makes a transition 
to) state increase, unless this type of transition has occurred for a number (equal 
to goJ^or^ast_increase(i)) of consecutive times. When the latter occurs it is inferred 
that a faster increase rate is needed and a transition to state f ast_increase oc- 
curs. go^orJ^ast_increase(i) is a function that depends on the current rate r(i); the 
larger r(i) the larger its value, implying that a lower rate flow enters state 
f ast_increase sooner than a higher rate one. 

In order to maintain some smoothness in the process and avoid frequent rate oscil- 
lations the state increase (as opposed to f ast_increase) is visited from states 
f ast_decrease upon zero-PLR feedback. 

Upon zero-PLR feedback the process moves from state decrease to increase 
(self-explanatory) unless a decision to lock the rate is taken (see section 4) in which 
case the state lock is visited. 
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Upon zero-PLR feedback the process remains in state lock, unless this type of 
transition has occurred for a number (equal to go_for_check_bw ( i )) of con- 
secutive times, in which case the state check_f or_bw is visited. 



Table 1. State transitions under the zero-PLR feedback (except rmax). 



Current state 


Next state 


Condition 


just intro 


fast increase 


always 


deer due nofeed. 


fast increase 


always 


rmax 


rmax 


always 


rmin 


increase 


C<: < times in rmin (i) ,e.g., 3 


low increase 


C^: f ,(i) = times in rmin(i) 


increase 


increase 


C,:T,„„(i) < f>o_for_fast_increase(i) 


fast increase 


C,: T (i) = f>o_for_fast_increase(i) 


fast increase 


fast increase 


always 


fast decrease 


increase 


always 


decrease 


increase 


C,:not B,Ji) 


lock 


notC,:S,„„/i) 


lock 


check for bw 


C^: check bwfij, e.g., 8 


lock 


not C, 


check for bw 


check for bw 


C„: t (i) <go to unlock (i) , e.g., 4 


increase 


not C„ 


low decrease 


low increase 


always 


low increase 


increase 


_,(i)>escape low states fij , e.g. ,20 


low increase 


not C., 


low lock 


low check for bw 


C,^:I (i)>go for low check bw(i) 


low lock 


not C,„ 


low check for bw 


low increase 


C, , : I > go to low unlock (i) , e.g., 6 


low check for bw 


not Cjj 



Upon zero-PLR feedback the process remains in state check_f or_bw, unless 
this type of transition has occurred for a number (equal to go_to_unlock (i) ) of 
consecutive times, in which case the state increase is visited and the flow is un- 
locked. 

Upon zero-PLR feedback the process moves to state low_increase or in- 
crease depending on the total number of visits to rmin since the initiation of the 
flow. If this number is equal to or greater (less) than times_in_rmin (i) then 
state low_increase (increase) is visited. Visiting rmin for a number of times 
implies that it is likely that the available bandwidth is only slightly greater than 
r'min (i) and that thus a lower rate increase that is associated with state 
low_increase would be more appropriate. 

The collection of states with the prefix low form a sub-space (referred to as low- 
state subspace) which is entered by the process only from state rmin under zero-PLR 
feedback and provided that times_in_rmin (i) threshold is exceeded. While in 





A Multi-state Congestion Control Scheme for Unicast Continuous Media Flows 459 



the low-space subspace, the process evolves in a similar manner under zero-PLR 
feedback as described earlier, with minor decision threshold (parameter) adjustments 
as captured by the introduced parameters (“history” counters). The only exit point 
from this subspace is from state low_increase under zero-PLR feedback and 
provided that this state has been visited for a number (equal to es- 
cape_low_states (i)) of times, in which case there appears to be room for 
higher rate increase and the state increase is visited. 



2.2 State Transition under Nonzero Packet Loss Rate Feedback {f(i)=p‘(i) >0). 

Under nonzero Packet Loss Rate (nonzero-PLR) feedback the state transitions shown 
on Table 2 take place under the specified conditions. Such transitions are followed by 
a rate decrease, as determined by the state visited and described in section 3. Some of 
the state transitions are self-justifiable in view of the comments provided for the vari- 
ous states in section 1 . When a transition to a state is accompanied by a rate decrease 
that results in a rate below r^in (i), then the rate is set to r^in ( i ) and an instantane- 
ous transition to state rinin occurs. State rmin may be visited from any state and 
transitions to rmin are not shown on Table 2 for simplicity. 

It is highly desirable to rapidly detect the introduction of a new CM flow in the 
network and adapt the flows’ rate in order to avoid excessive packet losses and pro- 
vide bandwidth to the new flow. The initial rate of the new flow may be too high for a 
loaded network. In this case the flows will experience packets losses due to the con- 
gestion introduced by the new flow. In order to distinguish whether the congestion is 
due to the initiation of a new flow or not, the reported packet loss rate p‘(i) is com- 
pared to a recent average packet loss rate. Let p\(i) denote this average, and let dijf(i) 
denote the difference between p‘(i) and p‘\(i): 

diff^(i) = [p'‘(i) - p'‘\(i)r 

If diff''(i) is greater than Pnew fiow(i) ’ then it is considered that the current packet 
losses (pYO) are significantly higher than the recent average losses (p'‘'\(i)), they are 
attributed to the introduction of a new flow and state fast_decrease is visited. 
Otherwise, it is considered that packet losses occur due to the increases of the CM 
flows and the state decrease is visited. The value used for Pnew fiow(i) is 0.04. 
When packet losses occur (pYO >0), the recent average packet loss rate pY(Y is up- 
dated as follows: 

p\(i) = c*p'‘\(i)+(l-cj*p(ih (c, = 0.875 ) 

Upon nonzero-PLR feedback the process moves from states just_intro, 
decr_due_nof eedback, increase, decrease, fast_increase, 
f ast_decrease and rmax to state decrease if dijf'‘(i)) < Pnew_fiow d ) to 
state f ast_decrease otherwise. When in states lock or check_for_bw the 
process maintains its state under nonzero-PLR unless dijf'‘(i)) > Pnew fiow(i) in 
which case the state f ast_decrease is visited. 

Under nonzero-PLR the process moves from state low_increase to state 
low_increase, unless a decision to lock the rate is taken function returns 
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1) and then the state low_lock is visited. If dijf'‘(i)) < Pnew flow the process 
remains in state low_lock (no rate change) under nonzero-PLR feedback unless 
this type of transition has occurred for a number (equal to can- 
cel_low_locking (i) ) of consecutive times, in which was a wrong locking rate 
inferred and the state low_decrease is visited (rate unlocking). If dijf '‘(i)) > 
Pnew fiow(i) then the process moves from state low_lock to state 
low_decrease. Finally, the process moves under nonzero-PLR feedback from 
state low_check_f or_bw to either low_lock if dijf'‘(i)) < Pnew flow (i) or the 
state low decrease otherwise. 



Table 2. State transitions under the nonzero-PLR feedback (except rmin). 



Current state 


Next state 


Condition 


just intro, 
deer due nofeed. , 
increase, rmax, 
decrease , 
fast increase, 
fast decrease 


decrease 


Ci2 • —Pnew flow (^) 


fast decrease 


not C^2‘ ^ Pnew flow (i) 


lock, 

check for bw 


lock 


c„ 


fast decrease 


not C„ 


low decrease 


low decrease 


always 


low increase 


low lock 




low decrease 


not C,, 


low lock 


low lock 


C.,ANDC,,; 

I >cancel low locking (i)) 


low decrease 


not (C„ AND C,.)= (not C„) OR (not CJ 


low check for bw 


low lock 


—Pnew flow (i) 


low decrease 


not C„ 


rmin 


rmin 


always 



3. The Distance Weighted Additive Increase, Loss Rate Based 
Multiplicative Decrease Scheme 

The proposed multi-state rate adaptation algorithm belongs in the class of Additive 
Increase Multiplicative Decrease (AI-MD) scheme [23]. In accordance with the AI- 
MD characteristic of the proposed scheme, the new rate r*‘(i) is given in terms of the 
previous rate r(i) by: 

r*'( i )= r(i)+ a( i ), for the increase case 

r*'( i) = r(i),i i ), for the decrease case 

It should be noted that the increase step a(i) and the decrease factor •(i) are not 
fixed - as in the classical AI-MD case - but the former depends on the state visited 
S^*'( i), r( i), r„ax (i) and (i) and the latter depends on the state visited S^*‘( i), 
rmin (i) and p‘(i). To capture the key fact that the increase step depends on the dis- 
tance between, r(i) and r^sx(i) and that the decrease factor is shaped by the re- 
ceived packet losses, the proposed scheme is referred to as the Distance-weighted 
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Additive Increase, Loss rate-based Multiplicative Decrease (D.AI-L.MD) rate adapta- 
tion scheme. 

(i) . In the case that the visited state is state increase or check_f or_bw, the 
increase step af/j has two attributes: the base increase rate Incr (i) in kbps and the 
rate distance factor dist_weight( r (i)). That is, 

ii)= dist_weight(r(i))*Incr(i), 

dist_weight( r( i)) =( ( i ) -r( i)) / ( r^ax U) - r„,±n (i)) 

Note that dist_weight(r(i)) expresses the distance of the current rate from rmaxiij- If 
r(i) then dist_weight(r(i)) I, whereas if r(i) then 

dist_weight(r(i)) — ^ 0. If flows i and j are such that r^^(i) =r^^(j ) , 
^min(i) =rmin(j ) ^ud r(i) > /(j), theu dist_weight(r(i)) < dist_weight(r(j)). This 
means that flows with lower rate are increasing at a faster pace than flows with higher 
rate, therefore the convergence time to fairness is expected to improved. 

(ii) . If the visited state is state f ast_increase, the increase step a(i) has also a 
third attribute denoted by d(i) . That is, 

•(i)=d(i) *dist_weight( r(i))*Incr(i) 

When state f ast_decrease is visited from another state, d(i) = d±nit (i)=5', d(i) 
is decreased by one at each adaptation point at which the flow remains in state 
f ast_increase. The reason for this decrease is that each time a fast increase 
occurs, the probability of having packet losses in the next time interval is expected to 
increase also. Thus, the next fast increase would be more conservative. When d(i) 
reaches zero it is reset to the value dinit (i) plus the number of successive visits to 
state f ast_increase. The reason is that successive visits to state 
f ast_increase indicates that most likely there is available bandwidth for the 
flow. Since, dist_weight() reduces the increase step ^i) as the rate increases, d(i) is 
increased in order to compensate for this reduction, after a large number of successive 
visits to state f ast_increase. 

(hi). If the visited state is state low_increase or low_check_f or_bw then 
the rate is increased by a constant step. That is, 

a(i) = incr_for_low ( i ) 

A typical value of incr_for_low ( i ) is 1kbps. Table 3 summarizes the differ- 
ent increase function steps a( i ). 



Table 3. Increase step function afi). 



Visited State 


Inerease step •(i) 


increase, check for bw 


a(i)=dist_weight(r(i))*Incr(i) 


fast increase 


a( i)=d(i) *dist_weight( r(i))*Incr (i) 


low increase, low check for bw 


0C(i)=incr for low(i) 



The decrease factor •(i) has two attributes. The first attribute is the value (l-p'‘(i)), 
which is a common attribute in the decrease factors when the visited states are de - 
crease, f ast_decrease, and low_decrease. The rate r(i) , (l-p'‘(i)) is 
roughly the rate under which no packet loss would occur, provided that packet losses 
occur at a constant rate p'‘(i) over the interval between two consecutive adaptation 
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points. In part, because this in not likely in most cases, packet losses will continue to 
occur even if the new rate is set to (l-p‘(i))*r(i). Therefore, a second attribute de- 
noted by decr_parameter ( i ) is introduced to intensify the rate decrease. The 
value of decr_parameter (i) depends on the state visited, as shown helow. 

(i) . If the visited state is decrease, decr_parameter ( i ) is less than 1; a 
typical value used is 0.98 but other values may be more appropriate depending on the 
environment. 

(ii) . If the visited state is low_decrease, decr_parameter (i) is set to 1. 
The reason is that since the rate in states low_decrease and low_increase is 
near (i) , further rate decrease is likely to reduce the rate to (i) which is 
not desirable. 

(hi). If the visited state is f ast_decrease, then it is likely that a new flow is 
initiated, as indicated earlier. In this case, the parameter threshold^ecr(i) with a 
value less than one and a typical value 0.9 is used: 

(a) if p"(i) <l-threshold^sar (i)> then decr_parameter (i) is set to 
threshold^jecr ( i)- Multiplicative reduction hy threshold^scr (i)*(l-p‘(i)) is a 
considerable reduction. The reason for this drastic decrease is that first, it reduces the 
probability of packet losses during the next interval and second, enables a new flow 
(which is likely to exist) to compete for more bandwidth. Otherwise, the new flow’s 
rate will be kept around its initial rate. 

(b) if p‘(i) > 1- thresholddecr (i) , the factor p(i) is set to (l-p‘ (i)). 

In the case the visited state is state decr_due_nof eedback, the factor used is 
param^afggd (i) ,with typical value of 0.75. Table 4 summarizes the different values 
of the multiplicative decrease factor. 



Table 4. Multiplicative factor j3(;). 



Visited State 


Multiplicative faetor p(i) 


decrease 


Pfi) = deer param(i)*{l-p''(i)) ; deer param(i) = 0.98 


low decrease 


P(i) = (l-p*(i)) 


fast decrease 


if (l-p'‘(i))>thresholddecr<i) {P(i)= thresholddecr ( i ) 
else {P(i) ={l-p‘‘(i))] 


deer due nofeed. 


P(i)=paramnofeed (i) 



4. Rate Locking/Unlocking Procedure 

The D.AI-L.MD scheme ensures that each flow converges to the fairness point, since 
it is an AI-MD scheme [23]. When a fair rate or the access bandwidth capacity (ceil- 
ing) is reached, an effective rate adaptation scheme should try to (temporarily) lock 
the flow’s rate. This locking should not be permanent but should be reconsidered 
periodically by employing a proper bandwidth probing mechanism. 

To lock the rate, a good estimation mechanism of the current fair rate (or the effec- 
tive access bandwidth ceiling) should be employed. Since the flow’s rate follows a 
pattern of increases and decreases, a good estimate of the locking rate should be 
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based on recent rates (as opposed to the current rate only), which should be properly 
weighted. Let local_min(i) denote the rate just before the most recent transition from 
state decrease to state increase and let 

sm_local_min"'*‘(i) = f/^sm_local_mirr(i)+ (l-fJ*local_min(i), where /2=0.9 
denote the smoothed running average of the local_min(i) rates. If the current rate is 
above the fairness point it will be decreasing (non-monotonically in general) to reach 
the fairness point, as the AI-MD characteristic of the algorithm guarantees; during 
this period both local_min(i) and sm_local_min(i) will be monotonically non- 
increasing. Similarly, if the current rate is below the fairness point, it will be increas- 
ing (non-monotonically in general) and during this period local_min(i) and 
sm_local_min(i) will be monotonically non-decreasing. When the rate reaches the fair 
rate level (fairness point), both the current rate and local_min(i) will oscillate around 
it. After some time period - which depends on the value of - sm_local_min(i) will 
reach the current local_min(i) and this marks the time instant when the rate is locked 
returns 1). The greater the value of the more accurate the locking rate is 
expected to be (that is, closer to the fairness point) and the longer time it would take 
to lock. The locking rate is not the one in effect at the locking time instant but rather 
it is a smoothed rate based on recent actual rates (and not the local_min(i) rates), 
given by 

aver_mte_recent i) =f^ *aver_mte_recent '‘(i)+(l-fj *r( i) (1) 

with a selected value for f,^ = 0.67. It should be noted that locking the rate near the 
fairness point is possible for flows with unlimited access bandwidth or flows whose 
access bandwidth is higher than the fairness point. Flows whose access bandwidth is 
below the fairness point are locked near their ceiling rate. 

As mentioned earlier, it may take a relatively long time to identify the locking time 
instant if the procedure mentioned above is followed. To reach a rate locking decision 
faster for flows whose ceiling is close to (i) (such as for dial-up users) and 
avoid harming rate oscillations and packet losses, a different rate locking strategy is 
followed for such flows. The rate is locked returns 1) to the 

aver_rate_recent(i) rate (as in (1)) as soon as more than a small number (equal to 
go_for_low_locking (i) and typically three) transitions occur from state 
low_increase to low_decrease. 

While in state lock the algorithm moves to state f ast_decrease and, thus, 
unlocks under a high nonzero-PLR feedback (dijf'‘(i))>Pnew flow (i))- After spending 
some time in state lock, state check_f or_bw is visited, after zero-PLR feedback, 
from which the algorithm may unlock the rate or eventually return to state lock. 

Since the procedure for visiting state low_lock is not as reliable as the one men- 
tioned earlier and the locking rate may be incorrect, a procedure to unlock the rate is 
introduced in an earlier section 2.2. It is reminded that from state low_lock the 
state low_decrease is visited (unlocking) either if packet losses occur during the 
first cancel_low__locking (i) ) intervals (typically two) after the locking decision 
or high nonzero-PLR feedback is received. 
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5. Simulations Results 



The network simulator (ns) [26] - with some developed RTP/RTCP support en- 
hancements - and the topology shown in Figure 2 are employed in the simulations. 
Nine sources and receivers are defined. Each source s, transmits an A-CBR flow i 
over RTP/RTCP to its associated receiver r . Receivers r, and are connected at 64 
kbps - simulating dialup users connected at this rate; receivers and are connected 
at 0.5 Mbps - simulating ADSL users connected at this rate; receivers to are con- 
nected at 1.5 Mbps, simulating LAN or ADSL users connected at this rate. Flows 1-4 
are access network limited, whereas flows 5-9 are only backbone network limited. 
The backbone links are full duplex, and their bandwidth capacity is 4 Mbps. The 
available memory sizes is equal to 25 packets for the interfaces that connect the 
routers and 2 packets size for the interfaces that connect sources and receivers to 
routers (1 and 4). The routers are drop-tail routers. The RTCP Sender and Receiver 
Reports (SR, RR) are sent every five (5) seconds. No TCP flows are assumed to be 
present to demonstrate the behavior of the proposed multi-state rate adaptation algo- 
rithm in an environment it is primarily designed for. 




Fig. 2. Network topology and access configuration. 

In all simulation configurations, the flows are initiated and terminated according to 
Table 5. Flow 6 is terminated at the 1200' second, whereas flow 9 is initiated at the 
1800' second to demonstrate the response of the simulated schemes to flow initia- 
tion/termination. The RTP packet size is common for all configurations and flows and 
set equal to 1000 bytes. The simulation duration is 2000 seconds. The following pa- 
rameters are used: (i) =56 kbps, r^ax (i ) =1 .2 Mbps, Incr (i) =30 kbps for 

all flows i, assuming that the sources are not aware of any access capacity limitation. 



Table 5. Start/stop times of flows and rates. 



flow 


1 


2 


3 


4 


5 


6 


7 


8 


9 


start time in sec. 


0 


1.33 


2.17 


3.22 


4.5 


5.17 


6.3 


7.1 


1800 


end time in sec. 


2000 


2000 


2000 


2000 


2000 


1200 


2000 


2000 


2000 


initial rate (kbps) 


56 


128 


440 


550 


400 


600 


1,180 


1.200 


600 
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Two different schemes are simulated and compared in order to demonstrate the 
performance of the proposed D.AI-L.MD scheme. The first scheme is the classic AI- 
MD scheme with constant increase step of 30 kbps and decrease factor of 0.97. This 
factor is close to the corresponding factor for an average packet loss rate of 1% of the 
D.AI-L.MD scheme (0.98*(1-0.01)). A decrease factor less than this value, e.g., 0.85 
or 0.5, leads to a more aggressive decrease, which may be appropriate for data but not 
for CM flows. Figure 3a illustrates the adaptation behavior of the AI-MD scheme. 




0 600 1000 1600 2000 

Time Axis (seconds) 




0 600 1000 1600 2000 

Time Axis (seconds) 



Fig. 3. Flows adaptation in AI-MD and D.AI-L.MD schemes. 
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The second scheme is the proposed multi-state rate adaptation algorithm (D.AI- 
L.MD). The following parameters of the D.AI-L.MD scheme are used: dinit (i) =5, 
paramnofeBd(i) =0 . 75, decr_param ( i ) = 0.98, thresholddecr (i ) = 0.9, 

max_cons_times_in_decrnofeed(i) = 3, incr_for_low ( i) = 1 kbps, 
pnew_flow ( ^ ) = 0.06 for flows 1-2 f Pnew_flow ( ^ ) = 0.04 for flows 3-9, 

times_in_rmin ( i) = 3, go_for_check_bw (i ) =8, go_to_unclock ( i ) =4 , 
go_for_low_check_bw (i) =10, cancel_low_locking (i) =2 , go_to_low_unlo 
ck (i) =6 , escape_low_states (i) =20 . For the function go_for_fast_increase(i) 
= fi_min(i)+{l-dist_weight(r(i))*(fi_max(i) -fi_min(i)), the values of 2 and 
12 are used for parameters fi_min(i) and fi_max(i), respectively. Figure 3b 
illustrates the adaptation behavior of the D.AI-L.MD scheme. 

In view of the perceptual QoS requirement of CM flows it is highly desirable that 
the rate adaptations are smooth and as infrequent as possible. Oscillatory rate adapta- 
tion behavior leads to highly fluctuating perceived quality that may not be tolerated 
by the end users. The source also benefits from the fewer and smoother adaptations 
since it is less resource demanding. The congestion scheme that utilises the AI-MD 
scheme presents the largest deviation of the current rates from their running long- 
term rate average (Figure 3a), compared to proposed D.AI-L.MD scheme (Figure 3b). 
The latter scheme presents smoothed long-term oscillatory behavior. The AI-MD 
scheme induces a greater number of adaptations than that induced by the D.AI-L.MD 
scheme (Table 6). 



Table 6. Occurred adaptations in AI-MD and D.AI-L.MD schemes. 



Flows 


1 


2 


3 


4 


5 


6 


7 


8 


9 


Total 


AI-MD 


397 


396 


397 


396 


396 


236 


396 


396 


39 


3049 


D.AI-L.MD 


218 


197 


180 


229 


371 


202 


361 


356 


30 


2144 



Fairness Index F 




Fig. 4. Fairness index F in AI-MD and D.AI-L.MD schemes. 
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Fairness and convergence to fairness are requirements set when multiple flows (us- 
ers) share multiple resources [23]. The fairness index F, presented in [23], is used to 
measure the fairness among the A-CBR CM flows. This metric is not meaningful for 
the limited-access bandwidth flows, i.e., the flows 1-4, and thus, the fairness index F 
calculation is based on flows 5-9: F* = (r(5)+ r(6)+ r(7)+ r(8)+ r(9)f / 5*(r{5f + 
r(6f +r(7f +r(8f* r(9f). 

In Figure 4, the line y=l indicates the optimal fairness. The time required to reach 
this line and the achieved level are measures of the convergence to fairness and fair- 
ness. As illustrated in Figure 4, the D.AI-L.MD scheme requires less time to con- 
verge to the fairness point, as well as presents smaller deviation from it, which en- 
ables a more accurate locking. 



Table 7. Mean long-term 
packet loss rate (%). 



Flows 


AI-MD 


D.AI- 

L.MD 


1 


16.522 


1.657 


2 


16.857 


1.935 


3 


1.156 


0.483 


4 


1.215 


0.562 


5 


0.711 


0.275 


6 


0.730 


0.282 


7 


0.726 


0.272 


8 


0.757 


0.301 


9 


2.00 


0.485 



Table 8. Mean conditional 
packet loss rate (%). 



Flows 


AI-MD 


D.AI- 

L.MD 


1 


16.609 


4.570 


2 


16.389 


3.877 


3 


1.674 


0.933 


4 


1.587 


0,922 


5 


1.288 


0.894 


6 


1.266 


0.687 


7 


1.215 


0.690 


8 


1.256 


0.758 


9 


2.005 


0.936 



Table 9. Total received Kbytes. 



Flows 


AI-MD 


D.AI- 

L.MD 


1 


15,700 


15,250 


2 


15,698 


15,197 


3 


115,710 


119,679 


4 


114,442 


118,792 


5 


189,403 


192,803 


6 


97,801 


102,036 


7 


205,214 


200,336 


8 


213,642 


204,078 


9 


14,405 


15,767 


Total 


982,015 


983,938 



Table 7 shows the mean long-term packet loss rate, that is the total number of lost 
packets over the total number of transmitted packets during the simulation time. Ta- 
ble 8 shows the mean conditional packet loss rate, that is the mean value of the re- 
ported packet loss rate averaged over the intervals during which nonzero losses oc- 
curred. The mean conditional packet loss rate shows the average level of losses when 
they occur. The D.AI-L.MD scheme presents significantly lower loss rates than the 
AI-MD scheme, especially for flows 1 and 2, which transmit at a low rate due to the 
access network limitations. 

Statistics concerning the bytes sent and received are gathered. Table 9 presents the 
total received Kbytes during the simulation period. Figure 5 illustrates the through- 
put, that is the received RTF packets, over the time window of 1000-1500 seconds. 
The throughput during this time interval is typical for the overall simulation time. As 
observed from Table 9 and Figure 5, the D.AI-L.MD scheme is slightly better than 
the AI-MD scheme in terms of throughput. 

As far as the time to recover the bandwidth released by a terminated flow (respon- 
siveness) is concerned, the AI-MD scheme reaches the bandwidth capacity faster- due 
to the constant increase step - than the D.AI-L.MD scheme (Figure 6). The D.AI- 
L.MD scheme seizes the bandwidth at a slower pace than the AI-MD scheme, be- 
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cause of the distance-weight additive increase feature of the scheme. The respon- 
siveness in the D.AI-L.MD scheme depends on the distance of the flow’s rate from 
the The larger the distance (relatively low rate flows), the faster the re- 

sponse. The less the distance (relatively high rate flow, as in the simulation case), the 
slower the response. Despite the slower response to the bandwidth release for rela- 
tively high rate flows, the D.AI-L.MD scheme is better than the Al-MD in terms of 
fairness, convergence to fairness, packet losses, throughput and oscillatory behavior. 




Fig. 5. Throughput of the schemes. 



Fig. 6. Response to bandwidth availability. 



6. Conclusions 

In this paper, a multi-state congestion control algorithm for rate adaptable unicast CM 
flows in a TCP-free environment has been presented. This scheme (a) provides a finer 
rate adaptation decision space compared to traditional schemes, (b) enables rate lock- 
ing, and (c) responds well to a diversity of network situations, such as limited access 
bandwidth and initiation/termination of CM flows. The proposed D.AI-L.MD scheme 
presents significantly less packet loss rates and oscillatory behavior compared to the 
AI-MD scheme. It also presents faster and closer convergence to the fairness point. 
The AI-MD scheme presents better responsiveness to bandwidth availability in com- 
parison to the D.AI-L.MD scheme. Further work is in progress to further investigate 
the performance of the proposed scheme and extend it for environments in which 
both TCP/IP and A-CBR CM flows co-exist in the access and/or core network. 



References 



[1] Floyd, S., “TCP and Explicit Congestion Notification”. ACM Computer Communication 
Review, V. 24 N. 5, October 1994, p. 10-23. 

[2] RFC 2481. Ramakrishnan, K.K., and Floyd, S., “A Proposal to add Explicit Congestion 
Notification (ECN) to IP”. January 1999. 





A Multi-state Congestion Control Scheme for Unicast Continuous Media Flows 469 



[3] V. Jacobson, “Congestion avoidance control”. In Proceedings of the SIGCOMM '88 Con- 
ference on Communications Architectures and Protocols (1988). 

[4] RFC-2001. “TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery 
Algorithms”. 

[5] RTF Frequently Asked Questions: http://www.cs.columbia.edu/~hgs/rtp/faq.html . 

[6] RFC 1889.RTP: A Transport Protocol for Real-Time Applications. 

[7] Floyd, S., and Fall, K., “Promoting the Use of End-to-End Congestion Control in the Inter- 
net”. IEEE/ ACM Transactions on Networking. August 1999. 

[8] I. Busse, B. Defner, H. Schulzrinne, “Dynamic QoS Control of Multimedia Application 
based on RTP”. May 1995. 

[9] J. Bolot,T. Turletti, “Experience with Rate Control Mechanisms for Packet Video in the 
Internet'”. ACM SIGCOMM Computer Communication Review, Vol. 28, No 1, pp. 4-15, 
Jan. 1998. 

[10] D. Sisalem, F. Emanuel, H. Schulzrinne, “The Loss-Delay Based Adjustment Algorithm: 
A TCP-Friendly Adaptation Scheme”. 1998. 

[11] D. Sisalem and A. Wolisz, “LDA-I-: Comparison with RAP, TFRCP”. IEEE International 
Conference on Multimedia (ICME 2000), July 30 - August 2, 2000, New York. 

[12] D. Sisalem and A. Wolisz, “MLDA: A TCP-friendly congestion control framework for 
heterogenous multicast environments”. Eighth International Workshop on Quality of Ser- 
vice (IWQoS 2000), 5-7 June 2000, Pittsburgh. 

[13] D. Sisalem and A. Wolisz, “Constrained TCP-friendly congestion control for multimedia 
communication” tech, rep., GMD Fokus, Berlin Germany, Feb. 2000. 

[14] D. Sisalem, A. Wolisz, “Towards TCP-Friendly Adaptive Multimedia Applications Based 
on RTP”. Fourth IEEE Symposium on Computers and Communications (ISCC'1999), (Red 
Sea, Egypt), July 1999. 

[15] R. Rejaie, M. Handley, D. Estrin, “An End-to-end Rate-based Congestion Control Mecha- 
nism for Realtime Streams in the Internet”. Proc. INEOCOMM 99, 1999. 

[16] J. Padhye, J. Kurose, D. Towsley, R. Koodli, “A Model Based TCP-Friendly Rate Control 
Protocol”. Proc. IEEE NOSSDAV'99 (Basking Ridge, NJ, June 1999). 

[17] ISO/IEC 14496. MPEG-4. 

[18] J. Pasquale, G. Polyzos, E. Anderson, and V. Kompella, “Filter propagation in dissemina- 
tion trees: trading off bandwidth and processing in continuous media networks”. Proc. of 
NOSSDAV'93. 

[19] E. Amirm, S.McCanne, H.Zhang, “An application-level video gateway”. Proc. ACM 
Multimedia ’95, San Francisco, Nov. 1995. 

[20] 1ST VideoGateway project: http:/www.optibase.com. 

[21] ISO/IEC 13818. “MPEG-2 - Generic coding of moving pictures and associated audio 
information”. 

[22] L. Vicisano, J. Crowcroft and L. Rizzo, “TCP-like Congestion Control for Layered Multi- 
cast Data Transfer”. Proc. InfoCom'98, San Francisco, March/ April 1998. 

[23] R. Jain, K. Ramakrishnan, and D.-M. Chiu. “Congestion avoidance in computer networks 
with a connectionless network layer.” Tech. Rep. DEC-TR-506, Digital Equipment Corpo- 
ration, Aug. 1987. 

[24] J. Padhye, V. Firoio, D. Towsley, J. Kurose, “Modelling TCP Throughput: a Simple 
Model and its Empirical Validation”. Proceedings of SIGCOMM 98. 

[25] T. Kim, S. Lu and V. Bharghavan, “Improving Congestion Control Performance Through 
Loss Differentiation”. International Conference on Computers and Communications Net- 
works '99, Boston, MA. October 1999. 

[26] The network simulator (ns): http://www.isi.edu/nsnam/ns/ns-build.html 



PCP-DV: An End-to-End Admission Control 
Mechanism for IP Telephony 



G. Bianchi^, F. Borgonovo^, A. Capone^, L. Fratta^, and C. Petrioli^ 



^ Universita di Palermo 
^ Politecnico di Milano 
® Universita di Roma “La Sapienza” 



Abstract. In this paper we describe a novel endpoint admission con- 
trol mechanism for IP telephony: the PCP-DV which is characterized 
by two fundamental features. First, it does not rely on any additional 
procedure in internal network routers other than the capability to ap- 
ply different service priority to probing and data packets. Second, the 
triggering mechanism for the connection admission decision is based on 
the analysis of the delay variation statistics over the probing flow. Nu- 
merical results for an IP telephony traffic scenario prove that 99th delay 
percentiles not greater than few ms per router are guaranteed even in 
overload conditions. 



1 Introduction 

It is widely accepted that the today best effort Internet is not able to satis- 
factorily support emerging services and market demands, such as IP Telephony. 
Real-time services, in general, and IP telephony, in particular, require very strin- 
gent delay and loss requirements (less than 150 ms mouth-to-ear delay for toll 
quality voice), that have to be maintained for all the call holding time. The 
analysis of the delay component in the path from source to destination shows 
that up to 100-150 ms can be spared for compression, packetization, jitter com- 
pensation, propagation delay, etc [J, leaving no more than few tens of ms for 
queueing delay within the many routers on the path. 

Many different proposals aimed at achieving such a tight QoS control on 
the Internet have been discussed in IETF. IntServ/RSVP (Resource reSerVation 
Protocol) [2IS] provide end-to-end per-flow QoS by means of hop-by-hop resource 
reservation within the IP network. Such approach imposes a significant burden on 
the core routers, which are required to handle per flow signaling and forwarding 
information on the control path. 

A completely different approach is provided by Differentiated Services (Diff- 
Serv) j,5ltij . In DiffServ, core routers are stateless and unaware of any signalling. 
They merely implement a suite of buffering and scheduling techniques and ap- 
ply them to a limited number of traffic classes, whose packets are identified on 
the basis of the DS field in the IP packet header. The result is that a variety 
of services can be constructed by a combination of: (i) setting packets DS bits 
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at network boundaries, (ii) using those bits to determine how packets are for- 
warded by the core routers, and (iii) conditioning the marked packets at network 
boundaries in accordance with the requirements or rules of each service. 

While DiffServ easily enable resource provisioning performed on a manage- 
ment plane for permanent connections, their widely recognized limit is the lack 
of support for per-flow resource management and admission control, resulting 
in the lack of strict per flow QoS guarantees. Recently, a number of propos- 
als have shown that per flow Distributed Admission Control schemes can be 
deployed over a DiffServ architecture H7l8IHIl()lllll2lldll4H5lltll7l . Although 
significantly differing in implementation details, these proposals, referred here- 
after with the descriptive name Endpoint Admission Control (EAC) (according 
to the overview paper H2! ) share the common idea that accept/reject decisions 
are taken by the network end points and are based on the processing of “prob- 
ing” packets, injected in the network at set-up to verify the network congestion 
status. 

A detailed analysis of the above EAC solutions shows that most of them rely 
on some advanced form of cooperation from internal network routers, e.g. prob- 
ing packet marking |1 fin 7 \ and ad hoc probing packet management techniques 
m- More radical EAC solutions have been proposed in [81 1 1 )j . These schemes 
require the routers to be only capable of distinguishing between probing and 
data packets (e.g. via TOS precedence bits, or DSCP field of DiffServ), and to 
configure elementary buffering and scheduling schemes, already available in cur- 
rent routers. In particular, the capability to apply to probing and data packets 
a priority-based forwarding scheme is the only router requirement in 0, while 
in 0 a strict limit on the probing buffer size is also enforced. This simplicity 
makes these radical schemes, hereafter referred to as “pure” EAC, suited to be 
introduced in the Internet in a very short time frame. 

In this paper, we propose an improved version of the scheme presented in 
m called PCP-DV (Phantom Circuit Protocol - Delay Variation) where the ac- 
ceptance test is based on delay variation analysis. We have thoroughly evaluated 
the performance of PCP-DV, for a wide range of parameter settings, and proved 
that PCP-DV is indeed capable of providing 99th delay percentiles not greater 
than few ms per router even in heavy overload conditions. 

The paper is organized as follows. In section 2, the PCP-DV operation is 
described, and the rationale at the basis of the PCP-DV decision criterion is 
provided. Section 3 describes the simulation model, and presents the VoIP vari- 
able bit rate traffic scenario adopted. Section 4 is dedicated to the performance 
evalation and parameter tuning of PCP-DV. Conclusions are drawn in section 5. 



2 PCP-DV Basic Operation 

PCP-DV is an acronym for “Phantom Circuit Protocol with acceptance test 
based on Delay Variation analysis”. The PCP-DV connection setup scheme is 
shown in Figure H A user that wants to setup a connection starts a preliminary 
Probing Phase. Scope of this phase is to verify, by means of probing packets 
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Fig. 1. PCP-DV probing and data phases 



injected in the network by the source node, whether there are enough resources 
in the network to accept a new connection. The decision whether to accept or 
reject the connection is taken at the destination node, based on measures of the 
probing flow arrival statistics. 

The probing phase lasts until an explicit accept/reject feedback packet is re- 
ceived from the destination node, or a suitable timeout expires. In case of accep- 
tance, the sender enters the Data phase in which data packets (i.e. information 
packets) are transmitted. 

The PCP-DV probing phase, graphically shown in figure ^ consists in the 
transmission of a fixed number Np of packets with a fixed interdeparture time 
I. Once all the Np packets are transmitted, the source waits until a feedback 
arrives, or the probing phase timeout expires. 

The only requirement PCP-DV imposes to the core network is the capabil- 
ity to distinguish probe packets from data packets. Probes and data packets 
are tagged with a different label in the IP header (TOS or DSCP held). Core 
routers have the only task of forwarding packets according to a head of the line 
priority scheme: high priority to data packets, low priority to probing packets. 
This forwarding mechanism serves a probing packet only when the data packets 
queue is empty. Therefore, it guarantees that the probing traffic load, which is 
not admission-controlled, does not contend the use of bandwidth against estab- 
lished traffic. Furthermore, since probing packets use only resources not used by 
accepted calls, their flow received at the destination contains indirect informa- 
tion on the links congestion status that can be used to perform the accept/reject 
test. 
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In PCP-DV, the decision on whether to admit or reject a new connection is 
taken at the destination, and notified back to the traffic source by means of one 
(or more) feedback packets. The PCP-DV decision criterion operates as follows. 
Upon reception of the first probing packet, the destination node starts a timer, 
and measure the next probing packet interarrival time T. If the condition: 

I-Dt<T<I+Dt (1) 

where Dt is a parameter called “acceptance delay threshold” and represents the 
maximum tolerance on the received probing packets jitter, is met, the timer is 
restarted and the above procedure is iterated until all Np packets are received. 
Conversely, if the condition fails for one received probing packet or the timer 
exceeds the upper limit I + Dt without receiving any probing packet, the con- 
nection is rejected, and a feedback rejection packet is immediately sent back to 
the source. The parameters, Dt and Np allow to tune the PCP-DV effectiveness 
in controlling the accepted traffic load. 

The rationale of the proposed scheme comes from the observation that toll 
quality delay performance requires a strict control of the load accepted in a link. 
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Fig. 2. Throughput /delay tradeoffs 



Figure 0 shows the 99-th delay percentile of the accepted traffic, versus the 
accepted traffic load. These results have been obtained by simulating various 
EAC schemes (PCP-DV plus the approaches presented in jl4ll3) l. with several 
parameter settings in a 2 and 5 Mbit/s channels. We have observed that the 
delay behavior is independent from the acceptance scheme adopted. The sharp 
knee shape curves allow to determine a threshold on the accepted load e.g. 75% 
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to 79% for a 2 Mbit/s link, and 86% to 88% for a 5 Mbit/s link corresponding 
to a few (3 4-5 ms) of the 99-th delay percentile. The solution of the delay QoS 
is then reduced to the load control on the links. 




Accepted Load 



Fig. 3. Probability that the probing packets jitter exceeds a given delay threshold (3, 
5, 7, and 10 ms), versus link load - 2 mbit/s link capacity 



By simulation we have measured the probing packets delay variation and in 
Figure 13 we show the probability that delay jitter |J — T| exceeds the threshold 
delay versus the accepted load. Four acceptance delay thresholds, equal to 3, 5, 7 
and 10 ms, have been considered. From the figure, we see that the curves become 
sharper as the increases. However, none of the thresholds is able to provide 
a sufficiently high rejection probability at the critical load of 0.75. To reach 
the required efficiency of the test we need to consider multiple samples of the 
jitter. The power of the test performed in PCP-DV is measured by the rejection 
probability over Np — 1 measures shown in Figure 01 The goal to limit the traffic 
to 0.75 can be reached either with Dt = 3 ms and Np = 11 or with Dt = 10 
ms and Np = 77. Even if the two cases achieve the goal to limit the traffic, 
the curve with a small number of samples has an higher rejection probability 
at loads below the threshold. Therefore, there is a tradeoff between the number 
of probing packets (i.e. the probing phase duration) and the effectiveness of the 
test at low loads. 

For the correct operation of PCP-DV, it is requested that the links load actu- 
ally reflects the resources seized by the accepted calls. Therefore, in case of VBR 
traffic, conditioning mechanisms may be required to send dummy packets when 
sources are inactive or under-utilize the assigned resources. Such shaping proce- 
dures, common to resource reservation techniques based on traffic measurements 
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Accepted Load 

Fig. 4. call rejection probability versus load, for 3 ms and 10 ms delay thresholds, and 
for different number of probing pairs 



(see for example ^), constitute the price to pay for the reduced complexity of 
the call admission procedure. 

PCP-DV, differently from stateful centralized solutions, can provide only a 
single QoS requirement. In fact, as long as only two priority levels (probing/data) 
are used within the network routers, heterogeneous real-time connections, with 
different loss/delay requirements are forced to share the same queue, and thus, 
regardless of how sophisticated the end-to-end measurement scheme is, they ul- 
timately encounter the same performance. To overcome this limitation PCP-DV 
can be used to perform call admission within a DiffServ class. Isolation between 
DiffServ classes can then be achieved by adopting a WFQ-like mechanism assur- 
ing a given rate to the PCP-admission-controlled class. However, for the protocol 
to correctly operate, this mechanism must prevent admission control traffic from 
borrowing bandwidth from the other classes. For a thorough investigation of the 
architectural issues related to EAC schemes deployment, including way to im- 
plement the classes isolation described above and a discussion of mechanisms to 
provide multiple levels of service, see HZ! 

An important feature of PCP is its intrinsic stability and robustness. In fact, 
when an increase in the accepted traffic above the QoS limits occurs (e.g. because 
of rerouting of accepted connections, or because of misacceptance due to the 
unreliability of the measurement process), thanks to the forwarding mechanism 
employed, the probing traffic is throttled. This will prevent acceptance of new 
connections and the congestion disappears as soon as some active calls end. 



476 



G. Bianchi et al. 



3 Simulation Model 



To evaluate the throughut and delay performance of PCP-PD, we have used 
a simulator written in C++. We have considered an IP telephony variable bit 
rate traffic scenario. Voice sources with silence suppression have been modeled 
according to the two states (ON/OFF) Brady model particular, each 

voice call alternates between ON and OFF states. During the ON state, the voice 
source emits vocal talkspurts at a fixed peak rate Bp = 32 kb/s, while in the 
OFF state it is silent. Both ON and OFF periods are exponentially distributed 
with mean values equal to 1 s and 1.35 s, respectively. The actvity factor Bu 
defined as the fraction of time a voice source is in the ON state. 

We have considered homogeneous voice sources that generate fixed-sized 
packets 1000 bits long. This corresponds to an interdeparture packet time equal 
to 31.25 ms. Probing packets are generated at a constant rate with interdeparture 
time I = 26 ms. 

We have considered a dynamic link load scenario where calls are generated 
according to a Poisson process and have an exponentially distributed duration. 
For convenience, we define the normalized offered load, p, as: 



P = 



X 



BpBu 

C 



(2) 



where A (calls/s) is the call generation rate, 1/p (s) is the average call duration 
and C (kbit/s) is the channel rate. For a given normalized offered load, the 
probing traffic load depends on 1/p-, the greater the average call duration, the 
lower the probing load. 

Further assumptions in our simulation model are: zero lenk propagation de- 
lay, no loss of feedback packets, and instantaneous feedback reception. 



4 Performance Evaluation 

An extensive simulation campaign has been performed, using the program de- 
scribed in the previous section, to investigate the PCP-DV performance in several 
scenarios. In this section we summarize the results concentrating on the accepted 
load and the 99-th packet delay percentile achievable by PCP-DV under different 
system operation conditions. 

Figure El shows the normalized accepted load versus the offered load, for two 
link capacity values (C = 2,5 Mb/s). All curves, which correspond to different 
parameter settings, show the ability of PCP-DV to guarantee a few ms 99-th 
delay percentile (the maximum values are indicated in the figure) and the traffic 
limitation below the thresholds discussed in Figure El i.e. 0.75 and 0.85 for a 2 
and 5 Mb/s channel, respectively. This ability has been proved even in very high 
overload conditions, i.e. with an offered load up to 4 times the channel capacity. 

However, different parameter settings show different behaviors for offered 
traffic loads ranging between 0.5 and 1.5. In these practical operational condi- 
tions, better performance is obtained, as anticipated in Figure 0 by adopting a 
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Fig. 5. PCP-DV: Accepted vs. Offered load for varying interval size and measurement 
period length; 2 and 5 Mb/s link capacity. 99th delay percentiles are also reported for 
selected samples. 



longer probing phase {Np = 77) and larger Dt (7 ms). With these parameter val- 
ues, the acceptance test is less likely to reject calls in underload traffic conditions 
and results in an improvement of 20% of the accepted load for offered load equal 
to 1. To optimize the performance it is therefore suggested to choose the probing 
phase as long as possible compatibly with other performance requirements such 
call setup time lenght. 

The robustness of the PCP-DV parameter setting with respect to the probing 
traffic load is proved by the results in Table Ql Increasing the call duration from 
3 minutes to 10 minutes, the probing load reduces to 1/3, but the optimal delay 
thresholds remain practically unchanged. 



Table 1. Optimal delay threshold, Dt, for call duration of 3 and 10 min 



Np 


1 //i = 3 min 


l//i = 10 min. 


11 


3.0 ms 


2.7 ms 


21 


4.5 ms 


4.2 ms 


39 


5.5 ms 


5.3 ms 


77 


7.0 ms 


6.8 ms 



To extend the PCP-DV performance evaluation from the single link case, so 
far considered, to a multi-link network scenario we have considered the network 
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single hop calls 




Fig. 6. Multi-link scenario. 




Fig. 7. Multi-link scenario: Accepted vs. Offered load for single hop and multi-hop 
calls and 99th delay percentiles for selected samples; 2 Mb/s links capacity. 



in Figure El loaded by multi-hops calls, that cross all the routers, and single hop 
calls each one loading one link only. 

We have simulated an homogeneous scenario in which all links have the same 
capacity, equal to 2 Mbit/s. Traffic is generated as in the single link case, with 
average connection duration equal to 10 minutes. Figure [0 shows the accepted 
versus the offered load and the 99th percentiles of the delay distribution for the 
two different types of calls and several network sizes (i.e. number of routers). 

The number I of crossed routers has a negligible effect on the total accepted 
load, but, as expected, as I increases, we observe a higher percentage of admitted 
short calls. In fact, longer calls are more likely to detect “peak” load overflow 
in one of the many crossed links. Even if this is an expected behavior of any 
acceptance control scheme, one should note that PCP-DV tends to throttle the 
multi-hop traffic even in low load conditions. The 99-th delay percentiles, also 
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reported in Figure 0 confirm that, though the multi-hop delay increases with I, 
the target of a few tens of ms for a backbone can be met. 



5 Conclusions 

In this paper we have described the “Phantom Circuit Protocol Delay Varia- 
tion” (PCP-DV), a fully distributed end-to-end measurement based connection 
admission control mechanism able to support per flow QoS guarantees in IP 
networks. This scheme determines whether a new connection request can be ac- 
cepted based on delay variations measurements taken on the probing packet at 
the edge nodes. The only capability requested to core routers is to implement a 
2-priority classes forwarding procedure. 

The performance evaluation presented in this paper shows that tight QoS 
requirements can be supported by suitably engineering the protocol parameters. 
We have considered the extremely challenging IP telephony scenario and mea- 
sured that QoS requirements as tight as just a few ms 99-percentile delay can 
be guaranteed. 

The PCP-DV approach conforms to a stateless Internet architecture, and it 
is fully compatible with the Internet architecture promoted by the Differentiated 
Services framework. 
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Abstract. The paper presents an analysis in a DiffServ network scenario of the 
achievable Quality of Service performance when different aggregation 
strategies between video and voice real time services are considered. Each 
network node of the analyzed DiffServ scenario is represented by a WF^Q+ 
scheduler. The parameters setting of the WF^Q+ scheduler is also discussed. In 
particular, the necessary network resources, estimated by the WF^Q+ 
parameters setting obtained considering the aggregation of traffic sources 
belonging to the same service class, are compared with those estimated on a per 
flow basis. The higher gain achievable using the first approach with respect to 
the second one, is also qualitatively highlighted. The simulation results 
evidence the possible problems that can be raised when voice traffic is merged 
with video service traffic. As a consequence, the paper results suggest to 
consider in different service class queues the two kinds of traffic. 



1 Introduction 

A key challenge of the current telecommunication age is represented by the 
developing of new architecture models for IP networks in order to satisfy the recent 
QoS (Quality of Service) requirements of innovative IP-based services (e.g. IP 
Telephony and videoconferencing). 

At present, the ISPs (Internet Service Provider) often provide the same service level 
independently from the traffic generated by their clients. Taking into account the 
transformation of Internet to a commercial infrastructure, it is possible to understand 
the need to provide differentiated services to users with widely different service 
requirements. 

In this framework, the DiffServ approach is the most promising for implementing 
scalable service differentiation in IP networks. The scalability is achieved by 
considering the aggregate traffic flows and conditioning the ingoing traffic at the edge 
of the network. Aggregation obviously decreases the complexity of traffic control in 
the core network, but it produces some unwelcome effects, such as “lock-out” or 
“full-queues” phenomena, which contribute to increase end-to-end delay and jitter of 
the traffic flow of a single service. These two phenomena take place respectively 
when few flows monopolize queue space preventing other connections from getting in 
the queue and when it is not possible to maintain the queues non-full. Hence, the 
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effects of the aggregation mechanisms on the QoS parameters of the different 
aggregated flows need to be further analyzed. The first works in this field have 
highlighted relevant concepts to support traffic aggregation [1], however a still open 
issue is what kind of aggregation strategies is better to carry out. To this aim we 
investigate traffic aggregation strategies because there is still no clear position on 
what is the better configuration (standardization organisms say anything regarding 
this matter). On the other hand recent publication [2] suggests to divide network 
traffic in only two service classes (e.g. real-time and non real-time) but it seems, from 
our point of view, a little bit restrictive with respect to different traffic features. In the 
paper, we analyze the impact of the aggregation of real-time video and voice traffic in 
the same service class (hence, in the same queue in a per-service queueing system) on 
the experimented QoS parameters of the different flows. The QoS concept used in this 
work is to be identified with the whole set of properties which characterize network 
traffic (e.g. in terms of resource availability, end-to-end delay, delay jitter, throughput 
and loss probability). The results obtained in this scenario are then compared with 
those obtained considering real time video and voice as separate traffic flows. 

The DiffServ architecture [3] is a good starting point but it is useless if there is no 
teletraffic engineering background able to provide the necessary differentiation. 
Therefore we should take into account also scheduling disciplines and their 
dimensioning, in order to understand if they may affect results (changing scheduling 
discipline change the way the flows are treated). Hence, we have firstly chosen to use 
one of the best work-conserving scheduling algorithms (WF^Q+) instead of a non 
work-conserving one used in other analysis [4]. The choice derives from the 
assumption that the best the scheduling algorithm is (keeping acceptable its 
complexity) the better treatment a flow receives in terms of low end-to-end delay and 
jitter delay. Moreover, the parameters setting of the considered scheduling discipline 
has been analyzed as described in Section 4. 

In the analysis, we use as traffic characterization approach, the LEAP (Linear 
Bounded Arrival Processes) theory [5]. Furthermore, we investigate the effects on 
LEAP characterization of the multiplexing of the video traffic, instead of taking into 
account the simple sum of the traffic descriptors obtained with the single source. By 
means of simulation analysis we evaluate if the multiplexing gain derived from 
characterization of aggregated traffic does not affect the QoS parameters. 

The rest of the paper is organized as follows. In Section 2 we present the simulation 
scenario while in Section 3 we describe the voice source model, the video and data 
traffic taken into account in the simulation analysis. In Section 5, the results are 
discussed while Section 6 summarizes the main results presented in the paper. 



2 Simulation Scenario 

Our simulation scenario mainly reflects the topology of a DiffServ domain of an IP 
network. The simulation scenario is implemented using the OPNET Modeler vers. 
6.0.L, a powerful CAMAD (Computer Aided Modeling and Design) tool used in 
modeling communication systems and in analyzing network performance. The 
considered scenario is shown in Fig. 1 . 
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Fig. 1. Simulation Scenario 



The network model is represented by edge and core routers, each one having a work 
conserving scheduler that permit to realize the isolation of the entering flows, based 
on performance guarantees. The scheduling discipline is a Worst-Case Fair Weighted 
Fair Queuing with the addition of a shaper, obtained using a Shaped Starting Potential 
Fair Queuing (SSPFQ), denoted as WF^Q-t [6]. The WF^Qh- is a GPS (Generalized 
Processor Sharing) approximating service discipline with high fairness properties and 
relatively low implementation complexity. Moreover, in order to simulate a single 
DijfServ domain, we implement at the edge router the classifier and the marker 
necessary to associate each packet to the selected PHB (Per Hop Behavior). Based on 
this classification and marking, each packet receive the suitable forwarding treatment 
by the core routers. The traffic sources taken into account in the simulation scenario, 
are the most heterogeneous possible because we want to analyze the performance of a 
real network; it must integrate the carrying of video, voice and data traffic. Hence, 
describing the scenario shown in Fig. 1 in more details, every block named as host 1, 
4 and 7 contains 15 voice sources, while every block named as host 3, 6 and 9 
contains a video source. The remaining hosts contain “data” module, which simulate 
best-effort traffic. 

The statistics we have collected concern the most significant QoS parameters of real- 
time services, i.e. end-to-end delay and jitter delay, which are evaluated considering 
the connection among the different sources and the destination node shown in Fig. 1. 
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3 Source Models 

In the simulations, we adopt a model only for the voice sources, while for the other 
kinds of traffic we consider actual traffic data. 

The model used for the voice sources consists in an On-Off process, suggested by the 
typical behavior of a voice source with VAD (Voice Activity Detection): it is active 
or inactive depending on the talker is speaking or silent. Assuming that no 
compression is applied to voice signal, during active periods the source transmits at 
the constant bit rate of v=64 Kbps (this corresponds to a standard PCM codec with 
VAD). In-depth analyses of this traffic source, shown in literature, have emphasized 
that the distribution of active and inactive periods lengths can be approximated by an 
exponential function [7], with mean values respectively equal to T^,,=350 msec and 
T„ff=650 msec. The packet size is 64 bytes, and considering the bit rate and the header 
overhead (40 bytes taking into account the RTP/UDP/IP header) the source generates 
one packet every 3 msec. 



Table 1. Statistical parameters of considered video sources 



Video flow 


Mean rate 
(Mbps) 


Peak_rate 

(Mbps) 


GOLDFINGER 


0.584 


5.87 


ASTERIX 


0.537 


3.54 


SIMPSONS 


0.446 


5.77 



The traffic data used for the video sources are described in [8], where also their 
statistical analysis is presented. They have been obtained collecting the output of an 
MPEG-1 encoder loaded by different sequences of movies half an hour long. Some 
relevant statistical parameters of the considered traffic data, named Goldfinger, 
Asterix and Simpsons, are summarized in Table 1. 

The video packets are produced at application level, dividing the number of bytes 
produced by the encoder in the frame period, T=l/24 sec, in consecutive packets of 
size equal to 1500 bytes (in this case an MTU, Maximum Transfer Unit, of 1500 is 
supposed). Moreover, in the simulation we consider the 40 byte of overhead, related 
to the RTP/UDP/IP header, assuming that every packet transports 1460 byte of the 
traffic data produced by the encoder. 

The time interval between the generation of consecutive packets in each frame period, 
has been considered deterministic and equal to T/N, where N is the number of packets 
necessary to transport all the bytes produced by the encoder during a frame period. As 
an example. Fig. 2 presents a case where 3200 bytes are necessary for the encoding of 
a frame; at application level we divide the frame in three packets, which are sent with 
time interval equal to T/3 sec. 

The Best Effort sources have been obtained considering the traffic data acquired at the 
Faculty of Engineering of the University of Pisa. In particular, we consider the traffic 
exchanges by the Faculty of Engineering with the external world (essentially other 
University sites and Internet) by means of an ATM network at 155 Mbps [9]. The 
peak rate of the considered traffic is equal to 1 1 Mbps, while a mean rate of only 400 
Kbps has been observed; the high peak-to-mean ratio is a clear evidence of the high 
burstiness of the data traffic. During the data acquisition both the arrival time and the 
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size of each packet have been registered. Hence, in this case the packet generation 
process to use in the simulations is directly obtained from the traffic data. 



Siae =3200 bytes 




Fig. 2. An example of packet fragmentation at application level 



4 Parameters Settings 



In order to set the scheduler parameters, we first characterize the traffic sources by 
mean of the LEAP approach. The traffic characterization of the single source is 
obtained by the parameters {b, p) where b is indicated as bucket size and p as token 
rate. The physical interpretation of the LEAP parameters can be understood 
considering that the number of bytes produced by a single source in a time interval of 
(0 ,t), A(t), is upper bounded by (1). 

A(T)<b + pr Vt>0 (1) 



The results presented in [10] permits to have an upper bound for the end-to-end 
delay experimented by the traffic when it traverses through a Latency Rate schedulers 
network, as that considered in our simulation scenario (i.e. a WF^Qh- schedulers 
network); estimation of WF^Qh- latency term is described in [11]. In particular, the 
delay introduced by a single node to a packet belonging to the i-th flow, characterized 
by the LEAP parameters {bi, pi) is upper bounded by (2) (in the following we suppose 
that for the i-th flow a service rate equal to pi is allocated). 



b- 

D<^ + 0, 

Pi 



(2) 



Where 0^ represents the latency of the scheduler, defined as 
q rna^ Li max and L^ax respectively represent the maximum packet 

Pi ^ 

size of the i-th flow and of the global traffic arriving to the scheduler, while p; and C, 
are the service rate allocated to the i-th flow and the global output service rate 
respectively. Extending the analysis to a network of K WF^Qh- schedulers, the end-to- 
end delay experimented by a single packet belonging to the i-th flow is upper 
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bounded by (3). 




( 3 ) 



Where Qj indicates the latency of the j-th node evaluated for the i-th flow. 



The end-to-end delay bound has been obtained considering the worst-case 
analysis, which is more conservative with respect to the experimented end-to-end 
delay. Furthermore, as shown in [12], also the LEAP traffic characterization is 
conservative with respect to the statistical modeling approaches. These two 
considerations leads us to assume that the maximum end-to-end delay of the i-th flow 
can be upper bounded simply by (4). 



This hypothesis permits to establish the buffer size and the guaranteed rate to set 
in the scheduler, simply evaluating the LEAP curve of the i-th flow. 

The simulation results that will be presented in Section 5 point out the goodness of 
our assumption, showing the very conservative nature of the worst-case analysis. 

Considering the above hypothesis, the procedure used to set the scheduler 
parameters consists in evaluating the LEAP curve and in finding the point where this 

curve intersect the straight line bi = PiDi , where D, represents the maximum delay 
fixed for the considered source. 




(4) 



3«-CI6 



3video 



asterix 

bond 




2.5e+06 - 



simpsons 

3video 



0 



0 



le+06 2e+06 3e+06 4e+06 5e+06 6e+06 



Token rate (bps) 
Fig. 3. Video characterization 
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Fig. 3 shows the presented approach in the case of the video sources. In particular, 
in the figure we can observe the LEAP curves for three different video sources, and 
the curve related to the traffic obtained aggregating these sources. Moreover, 
assuming a maximum end-to-end delay of 200 msec, we can observe the relate 
straight line and the intersection points with each LEAP curves that, as described 
above, give the couple ( pi, bp to consider in the setting of scheduler parameters. 

The characterization results obtained with the considered three video flows 
(named Goldfinger, Asterix and Simpson), the aggregate of 15 voice sources 
(corresponding to a host in the simulation scenario) and the data traffic are reported in 
Table 2. In the table, the column Dmax indicated the maximum end-to-end delay 
analytically obtained from the estimation of scheduler parameters. 



Table 2. Traffic characterization of considered sources 



Traffic flow 


Rate (p) 
(Mbps) 


Euffer 

(b) 

(Kbit) 


Dmax (p/b) 
(msec) 


GOLDFINGER 


1.25 


250 


200 


ASTERIX 


0.83 


160 


193 


SIMPSONS 


1.27 


260 


205 


15 VOICE sources 


1.10 


30 


27 


Data 


0.40 


400 


1000 



In the setting of the scheduler parameters, we can choice to set a service rate equal 
to the sum of p^ and a buffer size equal to the sum of i>„ where p, and h, are the 
parameters referring to the i-th video source. Considering this approach and the 
results obtained with the procedure presented above, the total service rate to allocate 
for the video services and the related buffer size are equal to 3.35 Mbps and 670 Kbits 
respectively. 



Table 3. Parameters of the routers 





Output Service 
Rate (Mbps) 


Euffer size 
(Kbit) 


Eoundary routers 


3.06 


B„„,„=30 


B.,=500 


B„.,=250 


Core router 1 


6.12 


B„„,„=60 


B.„=1000 


B.*=500 


Core router 2 


9.18 


B.„.=90 


B.„=1500 


B„ =750 
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The other approach consists in the estimation of the parameters directly from the 
LEAP curve of the multiplexed traffic. In this case, it is expected to obtain a 
multiplexing gain in the setting of the resources to guarantee to the video service. In 
particular, considering the same upper bound of the end-to-end delay, we need to 
allocate a service rate of 2.5 Mbps and a buffer size of 500 Kbits. Then, in terms of 
buffer size we observe a gain of 170 Kbits (corresponding to a reduction of about 
25%), while in terms of service rate a gain of 850 Kbps (25%) is achieved. 
Considering the service rate evaluated for each kind of traffic source, it is possible to 
set the scheduler parameter assuming a utilization factor of the link equal to 0.9. In 
more details, we evaluate the sum of p;, Ptot, and fix the rate of the output link equal to 
Pto/0.9. Hence, for boundary and core routers, the parameters are set as summarized 
in Table 3. 

The buffer size is given for each traffic class, i.e. voice, video and data. When 
considering the scenario related to the aggregation of voice and video flows, the 
buffer size for this aggregated class of traffic has been set equal to 



5 Simulation Results 

The simulation analysis is mainly focused in the evaluation of the impact of different 
aggregation strategies on the QoS parameters. The low scalabity of the IntServ 
network architecture, suggests for the IP Telephony scenario to study a DiffServ core 
network architecture. Hence, it is expected that in each node of an IP Telephony core 
network, a per-class queueing is implemented and an appropriate scheduling 
algorithm guarantees the QoS requested by the real time sources. 

Two relevant problems arise in this framework. The first concerns the choice of an 
appropriate scheduling algorithm and of a procedure for the setting of their 
parameters. The second issue is related to the aggregation strategies to adopt. Hence, 
in this Section we present the results that give insights on these problems. 

We consider two different aggregation strategies: in a first one we have carried video 
and voice together in a premium class and data in a best effort class (in the figures the 
related curves are indicated with label containing the word “1 link”); in a second one 
we have supposed to carry video traffic in a separate queue from voice traffic (curves 
labeled with the noun containing the word “2 link”). 

Fig 4 presents the complementary probability of the end-to-end delay experimented 
by the voice packets when the two different strategies are considered and only one 
video source is activated (in this case the setting of simulation parameters have been 
changed according to the dimensioning procedure presented in the previous 
paragraph, in order to take into account the inactivity of the others video sources). 
Fig. 5 presents the same curves obtained when all three video sources are active. 

As first analysis of the simulation results, it is possible to note that the adopted 
scheduling algorithm, i.e. WF^Q-t, and the procedure for its parameters setting permit 
to guarantee the target QoS for the voice sources if the video and voice traffic flows 
aren’t merged in a single queue. In particular, the curves labeled as “1 Link” either in 
Fig. 4 and in Fig. 5 clearly show that the maximum delay observed during the 
simulation is under 10 msec, which is lower than the fixed delay of 27 msec, 
considered in the setting of the scheduler parameters. 
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Fig. 4. Complementary probability of voice delay (One active video source) 




Fig. 5. Complementary probability of voice delay (Three active video source) 

Both Fig. 4 and Fig. 5 show the degradation in terms of end-to-end delays of voice 
traffic when the video flows is merged in the same queue with the voice traffic. 
Indeed, in the Fig. 4, we can note that in correspondence of a probability P=0.001 a 
delay of 7 msec is observed in the first case (curve “1 Link”), which is lower than the 
15 msec registered in the second case. 

Furthermore, this degradation is amplified when the number of video sources is 
increased. Indeed, in Fig. 5 the maximum end-to-end delay registered for voice traffic 
is unvaried with respect to the previous case, i.e. about 7 msec, while it is increased to 
23 msec, when all real-time flows are aggregated in the same queue. 

Hence, in this second case the worsening of the delay parameter of voice service is 
due to the increase of the number of bursty traffic multiplexed with the voice sources. 
Hence, we can suppose that if there is more requested bandwidth the need for network 
resources increase in a non-linear way when considering a wrong strategy of 
aggregation, in this case the real time application may be damaged seriously. 
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On the other hand, the video performance take benefits from the aggregation, showing 
a little delay improvement that can be related to the multiplexing with the voice (the 
related figures are not reported for sake of simplicity). 

The different performance observed with the two considered aggregation strategies, 
can be related to “lock-out” phenomenon, which plays a decisive role in deteriorating 
voice performance. Indeed, when the video sources are merged with the voice traffic, 
the first monopolize the queue space obstructing the second from receiving the 
desired service level. The “lock-out” phenomenon can be avoided using different 
queues for video and voice traffic, while, at the same time, the choice of appropriate 
scheduling disciplines can guarantees an adequate multiplexing gain. 
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Fig. 6. Complementary probability of voice jitter delay (Three active video source) 



Finally, we observed that the best effort traffic (used as background traffic) is not 
affected by the fusion of the two service classes because the total bandwidth share 
(video + voice) is unchanged. 

The same worsening of performance can be observed when we consider the jitter 
delay parameter, as shown in Fig. 6, which plots the complementary probability 
estimated for this statistic (the results are related to the multiplexing of three video 
sources). 



6 Conclusion 

The main goal of the paper is the evaluation of different traffic aggregation strategies 
for real time voice and video services in a DijfServ environment. In this framework, 
the simulation analysis presented in the paper highlights that the wrong aggregation of 
traffic flows with different statistical features, such as video and voice traffic, may 
lead to performance worsening, which should be avoided especially in providing IP- 
based business services, such as IP Telephony. 

On the other hand, the simulation results emphasize that with an adequate isolation 
between video and voice traffic flows and an appropriate dimensioning of network 
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resources, it is possible to provide real-time services. In particular, the considered 
WF^Q+ schedulers network and the proposed procedure for setting the related 
parameters, permit to achieve the target QoS. Furthermore, analyzing the proposed 
procedure for the setting of schedulers parameters, based on the LEAP traffic 
characterization, it has been possible to highlight the multiplexing gain obtainable 
considering the LEAP characterization of aggregated traffic. 

Finally, the simulation results have evidenced that although we have neglect the 
latency terms in the expression of the end-to-end delay reported in literature, the 
maximum delay experimented is lower than that analytically estimated (see the fourth 
column in Table 2). This result is a further evidence of the very conservative nature of 
the worst-case analysis. 
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Abstract. Considering that current end to end communication services are not 
adapted for supporting efficiently distributed multimedia application, this paper 
introduces a new family of generic transport protocols directly instantiated 
from application layer quality of service requirements. This Generic Transport 
Protocol (GTP) has been successfully tested for video on demand systems and 
is one of the major building block of the currently under development GCAP 
European project. GTP allows one to apply in the transport layer powerful ad- 
aptation mechanisms to the network behavior while preserving application re- 
quirements and alleviating network bandwidth and buffering needs 



1 Introduction 

There is a clear need for a new generation transport protocol layer that could apply an 
efficient adaptation between application layer needs and network behaviour, capabili- 
ties and resources. The management of transport connections could be greatly en- 
hanced by informing the transport layer of the reliability, ordering, synchronisation 
and temporal constraints associated with Application Data Units. Indeed multimedia 
applications (i.e. video on demand, web browsing, access to MPEG4 or SMIL docu- 
ments...) do not need the full reliability and total ordering enforced by TCP. Indeed, 
these applications have partial order, partial reliability and specific synchronization 
constraints. Therefore, the use of TCP for multimedia applications induces a service 
that is not only unneeded by the transport service user but above all that can poten- 
tially seriously disrupt the semantics of media streams. This reason promoted UDP as 
the privileged transport layer for accessing to multimedia streams. However, such a 
solution oblige to introduce into each application complex mechanisms for enforcing 
application specific data ordering, synchronisation constraints and loss control. Dedi- 
cated application layer protocols such as RTP and RTCP do not greatly alleviate the 
load and complexity of these network aware applications which have to directly adapt 
their behaviour to the QoS delivered by the network. Therefore, neither UDP nor TCP 
are able to offer an efficient service in conformance with the great diversity of applica- 
tion needs. Eor insuring an efficient mapping between application needs and network 
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behaviour and services, the transport protocol must be aware of the specific ordering, 
loss and synchronisation constraints related to application data units. Such a generic 
transport protocol entails an application layer framing approach, which consist in 
defining, at the application layer, self dependant data units which are also considered 
by the underlying communication layers as transport data units and network data units 
[1,2]. If the size of these data units is lower than the size of the maximum transfer unit, 
such an approach avoids costly fragmentation of data units while allowing the trans- 
port protocol to take advantage of data units independency. 

TCP and UDP should be two specific instantiations of the considered generic trans- 
port protocol that should be able to deliver a continuum of transport services between 
these two extremities. At the difference of traditional application layer framing ap- 
proaches which puts all the burden on the application layer and oblige to reinvent the 
wheel for each application, a generic transport layer has only to be designed once and 
can be dynamically instantiated to be adapted to specific application needs. In this 
paper we introduce such a generic transport protocol coupled with a simple and direct 
derivation technique of a transport layer service from application layer QoS require- 
ments. 

In the first part of this paper we briefly introduce a formal technique for modelling 
multimedia components. Then we demonstrate that this formal approach offers a sim- 
ple and efficient solution for mapping application layer QoS parameters down to 
transport layer QoS parameters. Then this formal approach leads us to introduce a new 
family of Generic Transport Protocol (GTP) that can directly instantiated from the 
formal expression of application layer QoS requirements. In the last part of this paper 
we describe a platform independent Java implementation of GTP designed in the 
framework of the GCAP European Project. Finally two elementary experiments show 
that, when accessing discrete or continuous media this new generation of transport 
protocols delivers a service more compliant to the application needs and more effi- 
cient than the one offered by UDP or TCP. 



2 From Application Layer QoS Requirements 
to Transport Layer Service. 

Ideally, a transport protocol should realize efficient adaptations between the greet 
diversity of application needs and the network behavior. Current transport protocols 
either, like UDP, deliberately ignore the application needs and the network behavior 
or like TCP adapt their behavior to network conditions but deliver always a predefined 
service that ignores specific application needs. For realizing an effective and efficient 
adaptation between the network and the application, the transport layer, must be in- 
formed by the application of its needs. This customization of the transport service can 
be done when the application creates a transport service access point (e.g. at socket 
creation). 

We propose a three steps approach for using such a generic transport protocol that can 
apply efficient adaptation decisions between application layer QoS needs and network 
behavior and services: 
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1 . Definition of application layer QoS needs based on a formal model that allows the 
consistency of the application requirements to be checked. 

2. Formal derivation of a transport service from the application layer requirements 

3. Instantiation of the generic transport protocol with the previously obtained trans- 
port service. 




Fig. 1. Formal modeling of multimedia components, (a) A SMIL document, (b) The transla- 
tion of the SMIL document into a HTSPN specification 



We have previously shown that the design of complex and large scale distributed 
hypermedia applications can be greatly enhanced with the help of a formal model that 
allows the fundamental features of these applications to be specified and their proper- 
ties to be checked [3,4]. Our approach is based on a temporal extension of Petri Nets, 
called Hierarchical Time Stream Petri Nets (HTSPN), that allows one to express si- 
multaneously with the same formal techniques the reliability, ordering and temporal 
constraints associated to a multimedia or hypermedia application. Hence, this formal 
model allows the most fundamental QoS requirements of multimedia applications to 
be abstractly expressed. Moreover, taking into account the asynchronous behavior of 
current network services, this model allows one to specify the admissible temporal 
variability of multimedia components. The specification of the admissible temporal 
variability of multimedia components is done with the help of a 3-uple (x,n,y) called 
Temporal Validity Interval (TVI), where x, n and y specify respectively the minimal 
nominal and maximal admissible durations of the component. This model introduces a 
complete set of synchronization operators which define a formal semantics of syn- 
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chronization bfor asynchronous or weakly-synchronous systems [5]. This formal se- 
mantics suppresses synchronization non-determinism while offering scheduling flexi- 
bility for information access, delivery and presentation. 

The modeling power of HTSPNs allows the fundamental QoS requirements of ad- 
vanced hypermedia components, as defined by the soon available SMIL 2.0 standard, 
to be abstracted and formally expressed (Figure 1). 

In summary the HTSPN model allows not only the ordering requirements of the appli- 
cation to be specified through recursive sequential and parallel composition of media 
elements, but also the reliability and temporal requirements to be expressed with the 
help of powerful temporal synchronization rules. For instance the specification given 
in Figure 1-b states that the audio is the master stream of the inter-stream synchroniza- 
tion scheme between the audio stream and the image stream (this is graphically ex- 
pressed with the help of a bold arrow). This specification is done with the help of a 
master synchronization transition which states that: 

• The audio stream has to be fully presented to the user 

• The image stream may be partially presented 



3 Deriving a Transport Service from Application Requirements 

3.1 The Order and Reliability Dimensions 

In the previous section we have seen that the HTSPN model allows one to express 
three fundamental QoS features of multimedia applications: 

1 . Ordering constraints between the various application data units or components 

2. Reliability constraints 

3. Time constraints 

Because order and reliability constraints are intrinsic to a wide variety of distributed 
applications this consideration have lead to the definition of two specific transport 
protocols widely used in the internet, TCP and UDP, which deliver respectively a 
fully ordered fully reliable service and an not ordered not reliable service. Moreover 
none of these two protocols take into account application layer temporal constraints. 
However, as exemplified by the SMIL component in Figure 1 , multimedia documents 
or components neither needs a fully reliable and ordered service nor an unordered and 
unreliable service. For instance the component modeled in Figure 1 can support the 
partial or total loss of image I, and the audio and image I can be delivered in any 
order (i.e. as soon as the transport layer receive the image OR audio component it can 
be delivered to the service user). This statement, coupled with the gain obtained from 
insuring the management of partial order and reliability constraints at the transport 
level induce a new family of transport protocols which make the most of the applica- 
tion requirements for delivering an optimal service in terms of end to end delay and 
buffering and bandwidth needs. This new family of “application aware transport pro- 
tocols” delivers a connection oriented transport service defined from the ordering. 
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reliability and time constraints given by the application when opening a multimedia 
connection. 

Such a generic transport protocol raise the question of a method for deriving the 
transport layer service from application layer QoS requirements. This mapping be- 
tween the application layer specification and the transport service can be immediately 
obtained for the reliability and ordering constraints. Indeed the HTSPN specification 
defines intrinsically a partial order that is defined as follow. 

Definition 1. Let A={ai, ,an}, with I=(l,...n), a set of ADUs associated to an appli- 

cation layer QoS specification specified by a HTSPN H. The partial order specifica- 
tion of the transport service related to H is given by the set 0={ (aOigp^i) ,...., (a,)iE pk(i) 
} where P={pl,...pk} is a set of permutations on I that defines the partial order on the 
elements of A directly derived from H. 

In other words, a partially ordered transport service can deliver any sequence of 
ADUs that conforms with the ordering presentation requirements of the application. 
For example, the partial order constraints of the transport service derived from the 
HTSPN specification in Figure 1 is given by the set 0={(Vi,I,A,V2),(Vi,A,I,V2)}. 

Definition 2. A transport service s that delivers its Transport Service Data Units 

A={ai, ,an} following the order defined by o=(ai)igp(i) is ordering conformant with 

the transport service specification S defined by P if and only if pe P. 

For instance a transport service that delivers o==(Vi,I,A,V 2 ) is ordering conformant 
with the service specification S defined by 0={(Vi,I,A,V2),(Vi,A,l,V2)}. 

As seen previously, the synchronization operators introduced by the HTSPN model 
allow one to distinguish between mandatory application data units an the ones of 
which the presentation can be partially or totally skipped. More generally the HTSPN 
model allows a deterministic or probabilistic specification of admissible losses to be 
expressed. The following definition will consider only the deterministic point of view. 
The partially reliable transport service associated to an application layer QoS re- 
quirement given by a HTSPN H is defined as follows: 

Definition 3. Let A={ai, ,an}, with I=(l,...n), a set of ADUs associated to an appli- 

cation layer QoS specification specified by a HTSPN H. This HTSPN defines the set 
RczA of ADUs which must be processed by the application. The specification, S, of 
the partially reliable transport service related to this application layer QoS is also 
given by the set R which, in this case, defines the set of ADUs that must be delivered 
to the application by the transport service. 

For instance, the partially reliable transport service adapted to the multimedia compo- 
nent modeled in Figure 1 is defined by the set R={V1,A,V2} of the ADUs that must 
be delivered to the transport service user. 
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Definition 4. A transport service, that delivers a set r of ADUs, offers a reliability in 
conformance with a partially reliable transport service specification defined by the set 
R of mandatory ADUs if and only if Rczr. 

By combining definitions 1 and 3 we get the notion of partially ordered and reliable 
transport service. Such a service which delivers its TSDUs in conformance with both 
the ordering and reliability constraints defined for the processing of ADUs (i.e. the 
application layer processing schedule as defined by the formal specification is also 
considered as a logical access schedule to the transport service) offers the following 
advantages: 

• The application doesn’t need to buffer its ADU for reordering purpose 

• The application has not to manage reliability constraints 

• The receiving transport entity delivers its TSDU as soon as possible in confor- 
mance with the application needs. 

• Compared to a fully reliable and ordered service, the knowledge of partial ordering 
and reliability requirements for ADU delivery allows the buffering needs of the re- 
ceiving and sending transport entities to be reduced. 

• Unnecessary retransmission of non mandatory ADUs can be avoided. 

• Reliability and ordering constraints introduce some flexibility for ADU transmis- 
sion schedule. Therefore, in function of the monitored network QoS, the sending 
transport entities can apply wise filtering or reordering decisions. 

Of course, such an application aware transport service has a sound impact on the fun- 
damental transport layer mechanisms such error control, rate control and congestion 
control. 



3.2 The Time Dimension 

Before introducing time in transport services and protocols we need first to understand 
the meaning of “timed transport service”. A “timed transport service” can be defined 
as a service which delivers on time its service data units to the service user. For a 
multimedia component, composed of several discrete or continuous media, an applica- 
tion data unit must be immediately available when the previous ones have been dis- 
played. For instance, the audio-visual sequence of the multimedia component in Fig- 
ure 1 (i.e. the inter-stream synchronization scheme between A and I) can finish at any 
time within the relative interval (i.e. from the beginning of the audio-visual sequence) 
[11,13]. 

This time interval is obtained from the synchronization semantics of t 2 transition 
and from the temporal validity interval of the audio stream which is the master of this 
inter-stream synchronization point. Considering that this audiovisual sequence can be 
displayed between 11 and 13 time units the transport protocol deduces that video “V 2 ” 
can be delivered to the service user at any time during this relative time interval. Such 
a definition of a “timed transport service” entails that the service user (i.e. the player 
of the multimedia component) can adapt the rate of its presentation to the transport 
service delivery. This limited and accepted adaptive behavior of the application helps 
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the transport protocol to “hide” the variations of the quality of service delivered hy the 
network. Moreover, on time delivery of data to the transport service user spares the 
application of scheduling access to remote data, of using time-stamping protocols such 
RTP and of buffering techniques for insuring network jitter and skew reduction. When 
used on top of a best effort network service this temporal transport service strengthen 
the isolation between the application and the network and allows wise adaptation and 
control decision to be applied in consistency with application needs. 




Fig. 2. Formal specification of the transport service adapted to the multimedia component 
modeled in Figure 1 . This transport service specification supposes that the service user accepts 
a 2 time units control time before playing the component. 

Once again such an approach raise the question of defining a temporal transport ser- 
vice adapted to the needs of a given application. The derivation between the applica- 
tion layer time constraints as expressed by a HTSPN model and the related timed 
transport service is less direct than for the reliability or ordering constraints. This 
derivation is done by using the following procedure of derivation. 

Definition 5. Let us consider an application layer QoS specification modelled by a 
HTSPN, H. Eor each ADU, a, in H let us note TVI(a) the Temporal Validity Interval 
associated to a, and pred(a) the abstract place (as defined in [6]) of H that represents 
the immediate predecessor of a. The specification of the temporal transport service 
adapted to H is given by the HTSPN specification H’ derived from H as follows: 

• V a e A/ pred(a)#0, TVI(a)=TVI(pred(a)) 

• V aeA / pred(a)=0, TVI(a)=(x,n,y), where x , n and y define respectively the 
minimum, nominal and maximum admissible waiting time for the delivery of the 
first ADU to the transport service user (i.e. the control time that the user accepts 
for the streaming its multimedia component) 

Definition 6. A transport service that delivers a set A={ai, ,an} of ADUs by follow- 
ing the delivery schedule T=(t(ai), ,t(a„)), where t(a;) is for the delivery time of Ui, 

is time conformant with a transport service specification modeled by a HTSPN H’ 
derived from H, if and only if : V a; e A/ TVI(ai)=(Xi,Ui,Zi) in H’, t(ai) g [t(pred(ai))+x. 
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t(pred(ai))+y]. That is, every ADU is delivered in a time window that takes into ac- 
count the admissible temporal variability of the previous ADUs. 

Figure 2 gives the formal modeling of the timed transport service specification associ- 
ated with the multimedia component modeled in Figure 1. This modeling introduces a 
control time (a parameter given by the application when opening the transport connec- 
tion) which specifies the maximum duration accepted by the transport service user 
before beginning to play the component. Note that the transport service specification 
can be automatically and simply derived by the transport protocol from the HTSPN 
that models the application requirements. 



3.3 The Space of Temporal Partially Reliable and Ordered Protocols 

A multimedia component, as defined in section 2, induces a timed partially ordered 
and reliable transport service which, in turn, defines a sub-space within the space of 
the whole family a timed partially ordered and reliable protocols that could be used 
for transporting the set of application data units which compose this multimedia com- 
ponent (Figure 3) [7]. 

In this space, a Timed Partially Ordered and reliable transport servic and connection 

(TPOC) associated to a set of application data units A={ai, a^} is uniquely defined 

by a 3-uple S=(0,R,T) where: 

• O is the set of admissible sequences extracted from A, 

• R is the subset of A composed of the elements which must be delivered to the 
service user, 

• TVI is the set of temporal validity intervals associated to the transport service 
data units delivered by the service S. 

For instance for the multimedia component in Figure 1, we have A={ Vi,A,I,V 2 }, and 
S=(0,R,T) with 0 ={(Vi,A,I,V2),(Vi,I,A,V2)1, R={Vi,A,V 2 } and 

TVI={(0,1,2),(12,15,20), (12,15,20), (11,12,13)}. 

Reliability, not 




ordered, and 
asynchronous 



Fig. 3. The space of timed partially reliable and ordered transport services 
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4 Time in Transport Protocols 

In the framework of best-effort networks, the transport layer has a fundamental role to 
play for adapting the network service to application needs. Considering that there is a 
gap between multimedia applications’ temporal requirements and the asynchronous 
behavior of current networks such as the Internet (i.e. the IP network best effort ser- 
vice), the transport layer is a privileged place where time related QoS parameters can 
be controlled and enforced. This approach aims to alleviate multimedia applications of 
the implementation burden of sophisticated buffering and adaptive techniques. There- 
fore, the design effort of multimedia applications can be greatly reduced by the use of 
a weakly synchronous transport service (i.e. a TPOC service) which delivers multime- 
dia information units according to time related QoS parameters derived from applica- 
tion level requirements. Such a new generation of transport protocols not only reduces 
the complexity of distributed multimedia applications, but also entails a dramatic 
improvement on the use of network and communication resources. Indeed, by taking 
into consideration at the transport level the temporal semantics of information units, 
this new approach allows more efficient congestion control, rate control, error control 
and buffer management techniques to be applied. Indeed, it is well known that the 
TCP congestion control technique is not adapted to the transport of multimedia 
streams. The slow start and congestion avoidance mechanisms take uniquely into 
account the QoS delivered by the network without considering the semantics of the 
transport service data units. Therefore, with such congestion control mechanisms 
variations in network QoS impact directly and blindly on the QoS delivered to the 
transport service user and are instantaneously perceived by the user. Such a behavior 
is not admissible for continuous media delivery of which the semantic is greatly de- 
pendant of time constrains. Our approach allows the transport protocol to react to 
congestion situations while taking into account the application requirements. This can 
be done by using the partial reliability and temporal flexibility offered by the concept 
of TPOC. Indeed, in function of the QoS delivered by the network, the TPOC can 
reduce it delivery rate and partially or totally suppress the sending of application data 
units which can be lost (i.e. the complementary of the set R in U). 

Equally, window based rate control mechanisms can be efficiently replaced by rate 
control mechanisms which take into account the temporal semantics of the transported 
application data units. Indeed the sending rate can be adjusted dynamically and con- 
sistently with the admissible rate variations supported by the transport service, in func- 
tion of the state of the receiving entity buffer. Note that partial reliability and order 
can be also used for rate control purpose. 

In summary the knowledge by the transport layer of the temporal semantics of the 
TSDUs offers potentially the following advantages: 

• Retransmissions can be more efficiently managed. 

• Flow control techniques compatible with application’s constraints can be applied. 

• Congestion control techniques can be used by combining the flexibility offered 
by the weakly synchronous time constraints on ADUs. 

• The temporal flexibility of ADUs offers multiplexing capabilities and allows 
network resources to be more efficiently used. 
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• Associating a temporal duration to ADUs entails a reduction of buffering needs 
at the receiver side (i.e. the data is received when needed) as well as at the sender 
side (i.e. by avoiding the buffering of out of date data). 

• Avoids the retransmission of out of date data. 

• Allows access schedules to ADUs to be managed by the transport layer instead of 
the application layer. 

• Ultimately this approach induces very simple applications which have only to 
react to transport layer events and unload the management of time and synchro- 
nization constraints onto the transport service. 

Weakly synchronous transport protocols are also useful in the framework of networks 
that deliver an integrated or differentiated service. By supporting some temporal ad- 
missihle variahility, these protocols offer some flexibility for network resources man- 
agement, and offer indirectly to the user and to the network provider, tradeoff facilities 
between quality and price. A TPOC specification defines an envelope of services, 
from the worst admissible to best for which the user is ready to pay; these services 
have to be dynamically adapted and mapped to the network layer differentiated or 
integrated services. This approach allows the network service provider to satisfy its 
clients while optimizing its resources usage. 



5 GTP Implementation 

A first version of a Generic Transport Protocol based on the notion of TPOC has been 
developed in the framework of the 5* 1ST European project GCAP (Global Commu- 
nication Architecture and Protocols for new QoS services over Ipv6 networks). 

GCAP aims at developing for the future Internet a new generation of end-to-end mul- 
ticast and multimedia transport protocols that provide a guaranteed QoS to advanced 
Multimedia Multipeer applications on top of heterogeneous networks 
The Java language has been used for designing GTP because the Java environment 
offers a multi-platform implementation that delivers the performance required for a 
transport protocol. Indeed, we have experimented that although C performs better than 
Java, Java performances are acceptable enough for designing a transport protocol in 
the user space and for offering an efficient support to multimedia applications [8]. In 
its current implemented version GTP offers a generic support to the partial order and 
reliability constraints required by the transport service user. Temporal services are not 
yet offered but should be available in the next version. 

GTP uses a pull approach, where the receiver side initiates the connection establish- 
ment and termination. At the sender side, a server waits for a connection request from 
the receiver. When the connection request arrives a sender socket is instantiated and 
connected to the receiver socket. The receiver sends an object request to the sender 
side in order to get a multimedia component. This object request includes the identifi- 
cation of the multimedia component, and QoS parameters (i.e. a compact representa- 
tion the TPOC specification of service). Therefore the sending and receiving entities 
share a common specification of the transport service which allows then to apply effi- 
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cient adaptation mechanisms between application needs and network behaviour. The 
GTP API is similar to the TCP standard java API as defined by the socket class of the 
java.net package (Table 1). 

Table 1. The GTP API 

Class GTPServerSocket 

This server waits for a connection request from a receiver side 

Constructor GTPServerSocket (local Address, local Port, max- 

Conn) 

This constructor creates a socket server using a local address, a local 

port and specifies the maximum number of connections 

GTPSenderSocket acceptQ 

This method waits for a connection and accepts it, instantiating a 
GTPSenderSocket connected to the receiver side 

Class GTPSenderSocket 

The sender socket is instantiated with the specification of the transport service 

sent by the peer entity. 

Request accept RequestQ 

The accept Request waits for a request from the receiver side 

void ackRequet(GTP.Request request) 

The ackRequest method acknowledges the requests 

void send(GTP.GTPPacket fp) 

This method allows one to send GTPpackets to the receiver side 

Class GTPReceiverSocket 

The receiver socket is able to send the object requests. The "receive" method al- 

lows one to read the GTPpackets ready to be delivered to the user 

Constructor GTPReceiverSocket(remoteAddress, remotePort, 

localAddress, localPort) 

This constructor creates a GTPReceiverSocket using the local and re- 

mote addresses and the ports specified 

void closeRequestQ 

This method allows one to request the termination of the connection 

ObjectRequest mediaObjectRequest(GTPMediaObject pmo) 

This method instantiates the transport connexion with a specification of 

service and send a request for a media object to the server side 

GTPPacket receiveQ 

The receive method allows one to get a GTPpacket 
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5.1 First Experiment 

This first experiment aims to evaluate the gain obtained by using a partial order proto- 
col on top of a non-reliable network environment. The experiment consists in testing 
the contribution of the GTP protocol for transferring simple JPEG still images be- 
tween an image server and a remote images player. The considered client server ap- 
plication is able to respectively receive and send independent segments composed of 
group of macro-block (i.e. these segments can be decoded and displayed in any order). 
For comparison purpose, this client-server application has been tested successively on 
top of UDP, GTP, and a fully reliable and ordered instantiation of GTP, hereinafter 
referred to as TCP*, which aims to simulate the TCP behaviour without its congestion 
control mechanisms (in its current version GTP does not apply any congestion control 
technique). Figure 3 illustrates the dummynet based emulation platform used for the 
experiment. In this experiment, the sender and receiver side are two Windows 2000 
systems located in two different subnets. A third computer running the FreeBSD sys- 
tem insures the routing services between the two subnets and supports the network 
layer emulation environment. Dummynet emulation capabilities have been used for 
they allow to tune very easily the main network QoS parameter such as bandwidth, 
losses and delays. 




Fig. 4. The emulation platform 

Figure 5 graphically illustrates the end to end transmission delay required for image 
218 Kbytes image. UDP (i.e. no order no loss) , GTP with full reliability and no order, 
and TCP* (i.e. full reliability and order) protocols are respectively used on top of a 
network service that induces 0, 5, 10, 15 and 20 percent of losses. The results show 
that GTP, though instantiated for delivering a fully reliable service, requires almost 
the same duration than UDP to transmit the data. In contrast, from 10 % of losses, 
TCP exhibits a dramatic increase of its transmission duration. 
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Total transmission time 
in a nonreliable network environment 
(Image of 218 Kbytes) 




Losses (%) 



Fig. 5. Comparison between UDP, TCP* and GTP end to end transfer delay for a JPEG image 



5.2 Second Experiment 

The main goal of the second experiment consists in comparing the partially reliable 
service offered by GTP with the fully reliable service of TCP* (once again TCP is 
emulated by a fully reliable and ordered GTP connection). The integration capabilities 
of GTP with existent multimedia applications represent another feature evaluated in 
this experiment. For this purpose, we have used the Java Media Framework player 
(JMF Player) for its capabilities to integrate dynamically new transport services and 
protocols on top of UDP. The same platform that the one described for the experiment 
1 has been used for the test bed and for network emulation purpose. The media object 
to transmit consists in a 1.6 Mbytes MJPEG video stream composed of 411 frames. 
The partial reliability service of GTP has been instantiated for supporting 30 % of 
losses. It is important to understand that, in this elementary experiment, because of the 
fragmentation of TPDU by the network protocol (i.e. video frames are considered as 
monolithic transport segments), 5% of network losses entails around 35% of losses 
for the transport layer segments. Note that for avoiding such a segmentation each 
video frame could be fragmented in independent ADUs like in experiment 1. Such an 
approach has been described in [9] and [10] where we have proposed a technique that 
allows MPEG video frames to be segmented into independent ADUs that can benefit 
not only from a partially reliable transport service but also from a partially ordered 
one. 

In presence of 5 % of network losses and with an admissible partial reliability of 30 
%, for the GTP service, GTP is able to deliver, within 39 seconds, 70% of the video 
frames to the transport service user. In contrast a fully reliable service requires 57 
seconds for transmitting the full video stream and entail at the application layer long 
blocking periods that are incompatible with the continuity constraints that must be 
satisfied for displaying correctly a video stream. Indeed, in this case, the use of a 
fully ordered transport service results in 87% of video frames missing their presenta- 
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tion deadlines, to be compared to 47 % with the partially reliable one considered in 
the experiment. 



GTP vs TCP* performance 
in Video Streaming 
( 5 % of network losses ) 




Time (secs) 



Fig. 6. Comparison between a partially reliable GTP instantiation and a fully reliable one 
(denoted TCP*) for the transport of MPEG video in presence of 5% of network losses. 

More generally a partially reliable service offers a trade-off capabilities between the 
controlled percentage of losses (uncontrolled with an unreliable service such as UDP) 
and the transmission delay (systematically longer with a reliable transport service such 
as TCP). 



6 Conclusion 

On top of highly performing networks there is a clear need of new transport services 
adapted simultaneously to the new types of services delivered by these networks and 
to the QOS needs of distributed multimedia applications which are pervading the 
Internet. We have introduced in this paper a new family of transport services and 
protocols which satisfy this double requirement. The contribution of this advanced 
transport services have been successfully experimented for videoconference and video 
on demand applications [9] and is one of the major building blocks of the currently 
designed GCAP 5* 1ST European Program. This new generic transport protocol de- 
livers a service and can exploit rate control, congestion control and error control 
mechanisms which are directly derived from a formal specification of the application 
QoS needs. When used on top of best effort networks such an approach enhance dra- 
matically the QoS perceived by the user. This approach is also very promising when 
used on top of differentiated or integrated services because she offers flexibility for 
the management of network and end-systems resources, hence GTP allows the user as 
well the network provider to trade off between performances and cost . 
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This new field of timed partially reliable and ordered transport protocols offers several 
open issues. Particularly, the compatibility of the currently studied congestion control 
mechanisms with TCP connections and TCP receiver or sender entities is a critical 
research topics which must be solved for insuring the successful dissemination of this 
new family of protocols. Moreover, extending the point to point approach introduced 
in this paper for delivering a multicast timed transport service partially reliable and 
ordered on top of heterogeneous networks is currently experimented in the framework 
of the GCAP European project. 
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Abstract. For time-constrained applications, repair-server-based active local re- 
covery approaches can be valuable in providing low-latency reliable multicast ser- 
vice. However, an active multicast repair service consumes resources at the repair 
servers in the multicast tree. A scheme was thus presented in ILUJ to dynamically 
activate/deactivate repair servers with the goal of using as few system resources 
(repair servers) as possible, while at the same time improving application-level 
performance. In this paper, we develop stochastic models to study the distribution 
of repair delay both with and without a repair server in a simple multicast tree. 
From these models, we observe that the application deadline, downstream link 
loss rates, the number of receivers, and the upstream round trip time of a repair 
server all influence the overall value of activating an active repair server. Based 
on these observations, we propose a modified dynamic repair server activation 
algorithm that considers the packet loss rate, the number of downstream receivers, 
and the round trip time to the nearest upstream active repair server when activat- 
ing/deactivating a repair server. From simulation, we observe that our modified 
dynamic repair server activation algorithm provides a significant reduction in the 
latency of successful packet delivery (over the original algorithm) while using the 
same amount of system resources. We also find that much of the performance 
gains achievable by having active repair servers can be obtained by having only a 
relatively small fraction of repair servers actually being active. 



1 Introduction 

Delay-sensitive applications such as teleconferencing, distributed simulation, multi- 
player games, and Internet telephony all have timing constraints on the successful de- 
livery of data from source to destination(s). In such applications, data that do not arrive 
at receivers prior to some application-determined deadline, are considered lost and can 
result in impairments in application-level performance. For a real-time video stream, for 
example, a missed deadline can result in playout jitter or breaks in the playout stream. 
For a conversation with interaction among multiple speakers, the delay from when a user 
speaks or moves until the action is manifested at the receiving hosts should be less than 
a few hundred milliseconds |5]. 

* This work is supported by the Defense Advanced Research Projects Agency (DARPA) under 
contract N6600 1 -9 1 1 7-4 1 1 V 
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Many reliable multicast protocols exploit local recovery El O O im Ol to 
reduce the delay in successful data delivery, making them attractive for supporting delay- 
sensitive applications. Similarly, by using active repair servers (RS) in a multicast tree to 
provide retransmission (i.e., error recovery) service, repair-server-based local recovery 
schemes can successfully reduce the repair latency, as well as suppress NAK implosion, 
and provide retransmission scoping |4 1 . On the other hand, active repair servers inside the 
network require additional resources (e.g., buffering and processing). Thus it is desirable 
to activate as few repair servers as possible, while at the same time providing enough 
repair servers to improve repair latency. 

In this paper, we study the tradeoff between repair delay (equivalently, the time 
needed to successfully deliver data to the receiver(s)) and system resource consumption 
by focusing on the benefit gained by using server-based active error recovery (AER) 0. 
We begin by developing stochastic models to study the distribution of repair delay both 
in the presence, and in the absence, of repair servers. Based on these models we observe 
that the application deadline, downstream link loss rates, the number of receivers, and 
the upstream round trip time of a repair server are all important criteria influencing the 
decision of whether to activate/deactivate an RS. 

We then consider specific algorithms for dynamic RS activation/deactivation. We 
begin with the algorithm from which introduced a protocol to dynamically acti- 
vate/deactivate repair servers on the basis of the packet loss rate within the RS’s subtree. 
For delay-sensitive applications, it is also appropriate to consider time-based measures 
such as the upstream RTT in deciding whether or not to activate an RS. We thus modify 
the algorithm in to consider not only the packet loss rate and number of downstream 
receivers, but also the round trip time to the nearest upstream active repair server (or the 
sender) when making the activation/deactivation decision. We study via simulation the 
tradeoff that exists between the fraction of RSs that are activated and the repair delay. We 
show that our modified dynamic RS activation algorithm provides a significant reduction 
in repair delay over the original algorithm m, while using no more system resources 
than the original algorithm. We also And that much of the performance gains achievable 
by having active repair servers can be obtained by having only a relatively small fraction 
of repair servers actually being active. 

The remainder of this paper is organized as follows. In Section 2, we describe a 
multicast loss recovery architecture and a reliable multicast protocol that uses active 
repair services. Section 3 presents the analytic model that we use to evaluate the decrease 
in repair delay when using an active repair server for time-constrained applications 
in a simple multicast tree. Section 4 proposes a modified algorithm for dynamically 
activating/deactivating repair servers, and presents simulation results comparing the 
performance of the modified algorithm with that of the original algorithm. Section 5 
concludes this paper. 



2 Real-Time Reliable Active Multicast: Motivation and Algorithms 

Several recent efforts have focused on providing better-than-best-ejfort service for delay- 
sensitive applications. Maxemchuk et al. Q, Lucas et al.|Bl| and other researchers have 
proposed various distributed receiver-based local recovery schemes. The Active Er- 
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ror Recovery (AER) protocol |@1 uses a repair- server-bzssd local recovery scheme, in 
which a repair server (RS) is attached to a router to perform active local error recovery 
(i.e., buffering and retransmission) for downsteam receivers and RSs. AER was shown 
to achieve low latency error recovery, NAK suppression, retransmission scoping, and 
superior bandwidth utilization. However, the use of additional active nodes inside the 
network comes at a price - additional system resources, such as buffers and computation, 
are needed at the active nodes. Several questions immediately arise - how many active 
RSs are needed and where should they best be placed; what is the tradeoff between the 
number of activated RSs and the repair latency; must a repair server always be active, or 
can a repair server dynamically monitor performance and then self-activate/deactivate 
according to observed performance. These are some of the questions we address in this 
paper. 

Rubenstein et al. I'm have proposed a static centralized tree-based protocol to con- 
struct a repair graph that uses RSs to retransmit lost packets to receivers before their 
deadline. However, when the multicast tree structure and/or link loss rates change over 
time, such a static approach may not be appropriate. In iQUi, Osland et al. presented 
an algorithm to dynamically activate/deactivate RSs in response to changing network 
conditions. However, their approach uses only the packet loss rate to trigger RS activa- 
tion/deactivation. We will see shortly that for delay-sensitive applications, delay-based 
criteria can be effectively used in making dynamic activation/deactivation decisions at 
an RS. 

Let us now describe a reliable multicast protocol known as AER (Active Error Re- 
covery m gnmni) that implements active server-based repair services; we will 
subsequently modify this protocol to implement dynamic RS activation/deactivation. In 
AER, subcast is defined to be a multicast transmission from an RS to the downstream 
multicast subtree rooted at the RS. We describe the algorithm in the context of a sender 
that periodically multicasts data to a multicast address that is subscribed to by all re- 
ceivers and participating Repair Servers; other scenarios are also possible. The algorithm 
operates as follows: 

- Packet forwarding, buffering. When a new packet arrives at a router associated 
with an RS, it is multicast downstream by the router and stored in the buffer of the 
repair server. 

- NAK suppression. On detecting a loss, a receiver (or repair server), after waiting 
for a random period of time (the NAK suppression time), sends out a NAK to its 
nearest upstream active server (a repair server or the sender). At the same time, it 
starts a NAK retransmission timer. If the receiver (or repair server) receives a NAK 
for the same packet from its upstream repair server prior to sending its own NAK, 
it suppresses its own NAK transmission. 

- NAK timeout. The expiration of the NAK retransmission timer at a repair server (or a 
receiver), without prior reception of the corresponding repair, serves as the detection 
of a lost packet for the repair server (or the receiver) and a NAK is retransmitted. 

- Repair packet transmission, NAK aggregation and propagation. When a repair 
server receives a NAK from a downstream node, it subcasts the packet if it has 
the packet in its buffer. Otherwise, if there is already a pending NAK for that lost 
packet, the new incoming NAK is suppressed. If there is neither a pending NAK 
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for this packet, nor a buffered repair packet, the RS immediately subcasts the NAK 
downstream and sends a NAK upstream after waiting for a random period of time, 
as well as keeping a pending NAK until the repair packet arrives. 

- Repair packet retransmission. On receiving a NAK from a downstream repair 
server, the sender re-subcasts the requested packet to all the receivers and repair 
servers. As mentioned above, these repairs are received by intermediate repair 
servers and forwarded down the tree only if there is a pending NAK for that re- 
pair. 

- Repair packet reception. If a repair packet is received, the repair server first checks 
whether there is a pending NAK for that packet. If so, that repair packet will be 
buffered and subcast downstream. If there is no pending NAK for this repair packet, 
the RS discards this packet. 

The AER protocol successfully reduces NAK implosion by using randomized timer- 
based NAK suppression and by using the repair server hierarchy for NAK aggregation. 
We next present a simple model of the delay performance of this protocol in a simple 
multicast tree. 

3 A Simple Model for Understanding the Benefits 

of Using a Repair Server in Time- Constrained Applications 

In this section we develop an analytic model to study the distribution of repair delay 
both with and without a repair server for time-constrained applications. Our goal is to 
gain insight into the effect of parameters such as application deadline, downstream link 
loss rates, the number of receivers, and the upstream round trip time of a repair server. 

Figure ^shows a generic model for a single RS co-located at a router. Node A is 
the sender. Node B represents an intermediate router, to which a repair server (that is 
denoted as a square node in this graph) can be attached. On receiving a packet from 
sender A, router B multicasts that packet on its subtree. R= {Ri, R 2 , Rn} denotes 
the set of receivers on the tree. The solid line between node A and node B represents the 
transmission path between the sender and the router (or its corresponding repair server). 
Similarly, the lines between router B and receivers to represent the paths between 

the router and receivers. 

The notation we will use is as follows: 

Xf. Time required to successfully deliver a packet from the sender to receiver 

Ri- 

Xij : Time to transmit a packet from i to j in the absence of packet loss, where 

i € {A, B} and j € {{5} U R}. We model Xij as a fixed value. 

Xji'. Time to transmit a NAK from j to i, where j S {{B} U i?} and 

i G {A, B}. We model Xji as a fixed value. 

Ti'. A joint timer of i, that integrates the NAK suppression timer, which is a 

function of the RTT between i and its nearest upstream active node (the 
sender or the active RS), and the NAK retransmission timer of i, which 
is a function of the RTT from i to the sender. Here i G {{5} U i?}. 

D\ The application-dependent deadline 
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Fig. 1. A Simple Multicast Topology 



Pij : Loss probability of path ij, where i G {A, B} and j G {{i?} U i?}. 

A: Recall that we are modeling a constant rate sender. Data packets arrive 

at A for first- tie transmission at constant rate A. 

We denote the probability that a packet is delivered after the application deadline at 
receiver Ri as. P{Xi > D}. 

Based on this single RS conhguration, we now describe two simple models that 
characterize the distribution of repair delay both with and without RSs. 



3.1 Evaluation of when an RS Is Used 

Let us hrst consider the transmission delay, Xi, in the presence of a repair server (Figure 
We introduce the following random variables. Let K denote the number of losses 
prior to the hrst successful transmission of a packet on path AB. Let Ki denote the 
number of losses prior to the hrst successful transmission of a packet on path BRi. We 
are interested in the value of Xi conditioned on K = k and Ki = ki. We denote this as 
Xi{k,ki) and compute its value as follows: 



Xi ( A: , ki^ 



Xab + Xbi k = 0,ki = 0 

Xab + Ai -f (fci — l)ri -f XiB -f XBi fc = 0, > 1 

< Ars + (k — 1 )tb + Xba + Xab + Xbz k > l,ki = 0 
Ars + {k — 1 )tb + Xba + Xab + A+ 

^ {ki - 1)tj -I- X^B + Xbi k>l,ki>l 



( 1 ) 



A timeline of Xi, when A: > 1 and ki > 1, is shown in Fig. |3 The loss detection duration, 
Ars, is a function of the constant transmission rate A and the delay of path AB. At the 
repair server, the time until the next successful receipt of a packet follows a geometric 
distribution with mean — - — • i. Thus we model Ars = Xab + — - — ■ v + Tr, 

where Tg is the NAK suppression timer of repair server B. Similarly we model Z\j = 
Xbi + ■ T + A ’ where is the NAK suppression timer of receiver Ri. Here, A 

denotes the time between the arrival of a retransmitted packet at the repair server and 
the receiver’s transmission of the next NAK. From Figure El we observe that A follows 
a uniform distribution in [0, r^]. 



512 



P. Ji, J. Kurose, and D. Towsley 




time before retransmission Xba + ^ab retransmission timer Xis + Xsi 

RS sends out timer of RS B of receiver Ri 

the first NAK 



Fig. 2. Timeline of Xi{k, ki). Repair Service is Used 



For a given k and a given ki, from equation (1), we can easily evaluate the value of 
Xi{k, ki). Consequently, we can determine the probability that Xi{k, ki) is greater than 
the application deadline D. The probability P{Xi > D} can be evaluated by removing 
the conditioning on the values of k and ki : 



Y, P{X^ >D} = Y. (2) 

i^R k—0 i^Rki—0 

where, P{Ki = ki} = p% ■ (1 - psi) , P{K = k} = p\g ■ (1 - pab) and 1(P) is 
one when the predicate P is true and zero otherwise. 



3.2 Evaluation of when RS Is Not Used 

The analysis is more complicated when there is no repair server attached at router B 
of Figure m because the end-to-end link losses are correlated. Thus, the probability that 
a packet is delivered after the application deadline cannot be solved independently at 
each receiver. In this case, we can use the following alternative approach to evaluate 
D}. Again, we focus on a randomly selected packet. 

^ ^ P{Xi > D} = ii/[number of receivers that receive the packet after the deadline] 

i&R 

= N — £/ [number of receivers that receive the packet by the deadline][3) 

where N denotes the number of receivers. To evaluate the expected number of receivers 
that successfully receive the data packet before its deadline, we must define some addi- 
tional notation. 

Let be the number of receivers that successfully receive that data packet before 
or at the Lth retransmission of that packet. Thus, from formula (3), we can derive 

Y P{Xi > D} = N- 

i^R 



(4) 
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where d is the maximum number of retransmissions allowed to meet the deadline. Here 
d is a function of the application-dependent deadline, D, and link transmission delays. 
Yamamoto et al. fi^ll provide a general model for evaluating the delay performance 
of receiver-initiated reliable multicast protocols. In our study, we assume that a NAK- 
receipt suppression timer is used by the sender. Within a NAK-receipt timeout interval, 
the sender considers multiple NAKs it receives for a given packet as being redundant 
and only retransmits (multicasts) the packet once. This timer is a function of the longest 
RTT to receivers. We simplify our problem to the case that all receivers have the same 
RTT to the sender. Therefore, at receiver Ri, during each timer r^, there can be at most 
one retransmission for a specific lost packet. Thus we have d = ^ 

where Ai denotes the loss detection duration of receiver Ri and is a function of pA%, 
XAi and A. Consequently, from equation (4), we have 

N 

'^P{Xi> D} = N =i} (5) 

i^R 2=1 

Given the application deadline and the corresponding maximum number of retransmis- 
sions d, we can calculate the probability = *} as follows: 

i 

P{T^^) = i} = ^ = j} ■ = j} 1 = 1, 2, ..., d (6) 

3=0 

where the conditional probability of P{T^^^ = = j} is shown in equation (7). 

{ PAB + (1 - Pab)Pb7^ > * = 7'; 

(7) 

(1 - Pab)(17X-X1 - PBi)^-^, N>i>j; 

with the initial condition: 

p{2^(-i) = 0} = 1 

Thus P{T<^d) ^ -| can be recursively computed and the corresponding 'y ^ ^ ^ 
£)} can be evaluated. 

3.3 Numerical Case Study 

Using the analysis techniques described above, we can evaluate the cumulative distri- 
bution function of the Xi, the time needed to successfully deliver a packet to a receiver 
in the presence/absence of a RS for various application-level deadlines, D. A numerical 
case study for the CDF of Xi is shown in Figure 01 Here, we assume the number of 
receivers is 5, path AB experiences a loss rate as pab = 0.05. For each downstream 
path, we assume a loss rate of pBi = 0.10. The one-way delay for path AB is 30ms, and 
the one-way delay between router B and each receiver Ri is 10ms. FigureOlshows that 
when a repair server is active, there is a higher probability that a packet is successfully 
delivered within a given deadline than the case when the repair server is not active. For 
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Fig. 3. Compare the CDF of Transmission Delay for WithAVithout RS Cases 



this set of parameters, we see the largest gain when the application deadline D is between 
80ms to 190ms. 

As the values of the model parameters change, so too does the distribution of Xi. 
Due to space considerations, we only briefly explore the parameter space to identify 
general trends; in OH we provide a more complete study. In m we define a general cost 
function that weights the benefits of decreased repair latency and the cost of resources 
(e.g., buffering and computation) required by active repair servers. The cost difference 
between the case of having active repair servers and not having active repair servers, is 
denoted by AC, and is dehned as: 

AC = > D}- > D} + Cb) 

where Ca and c& are weights reflecting the unit performance degradation penalty cost for 
an overly delayed packet and the resource cost for recovering a packet, respectively. 

We illustrate the effect of an application’s deadline on the cost difference AC in 
Figure E] We considered four downstream loss rates and observed that as the downstream 
link loss probability pBi increases, so too does the relative benefit of having active 
repair servers. For the parameters considered in Figure 0 the beneht is largest when 
the application deadline lies in the range of 2 to 4.6 times the one-way delay (about 
80ms to 190ms). As the application deadline increases beyond this value, the relative 
performance benefits decrease and actually become negative when D > 380ms. This is 
because as the application deadline increases the probability of successfully recovering 
a packet within the deadline increases (approaching 1) both with and without repair 
servers, while the with repair server case incurs a resource cost, Cb- In we also 
consider the effects of downstream loss rates (pBi), upstream round trip time (Xj^b) and 
number of receivers (N) on cost difference AC. 
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Fig. 4. The Effect of Application Deadline on the Saving of Transmission Delay by Setting a RS 



4 An Algorithm for Dynamic Activation/Deactivation 
of Repair Servers 



In null . Osland et al. modified the basic AER protocol to include a two-threshold algo- 
rithm for dynamically activating and deactivating repair servers. In their approach, an 
RS estimates the loss rate to receivers in its subtree over intervals of time. At the end 
of each interval, if the measured loss rate is greater than an upper threshold value, the 
repair server is activated; if the measured loss rate is less than a lower threshold value, 
the corresponding repair server is deactivated. 

Our goal is to develop a new algorithm that dynamically activates/deactivates a RS 
for the purposes of supporting time-constrained applications. In Section 3, we studied 
how system parameters such as the upstream RTT, number of receivers, application 
deadline, and link loss rate determine the benehts possible with the use of active RSs. 
Given that our primary application-level performance metric of interest - the probability 
that a packet is successfully delivered within its deadline - is time-related, it is natural 
to include time-based considerations into the activation/deactivation decision. 

Informally, our time- sensitive RS activation/deactivation algorithm operates as fol- 
lows. As in EH, an RS estimates (observes) the loss rate to receivers in its subtree over 
intervals of time of length r. We introduce variable A)oss to denote the number of lost 
packets during a time interval, and Npkt to denote the number of packets sent by the 
sender during the same time interval. Nioss and Npkt are measured at the RS. Let 
denote the loss probability of an RS’s subtree during time interval k. The formula for 
computing is: 



jjk 

rloss 



^ loss 
^pkt 



( 8 ) 
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Rather than use the measurement of lost packets to define above from ifTTl . we can 
derive the following expression for 

n 

pLs = 1 - 

i=l 

where n is the number of receivers suffering loss and p^ is the link loss rate between the 
RS and receiver Ri. Our earlier analysis showed that the number of receivers and the 
individual link loss rates were important factors in determining performance. We note 
that both of these factors appear explicitly in equation (9). 

In addition to using the number of receivers and the individual link loss rates in the 
activation/deactivation decision, we will want to include time-based measures as well. 
In our analysis in Section 3, we saw that the upstream RTT (the round trip time from an 
RS to the closest upstream RS or the sender itself - Xab in Figure 1 ) was an important 
factor affecting the delay distribution. Thus, we would like to incorporate the upstream 
RTT into our dynamic activation/deactivation algorithm. It is worth mentioning here that 
a crucial parameter affecting performance is the application deadline D. However, D 
is completely application-dependent and can easily be determined only at the receivers. 
Thus, we choose not to include D in our dynamic repair server activation/deactivation 
algorithm. 

4.1 Algorithm Description 

We now modify the RS activation/deactivation algorithm presented in ill 01 to account 
for the upstream RTT of a RS with pf^^g. Intuitively, if an RS is very close to its upstream 
RS (or the sender), activating the downstream RS will not significantly reduce the repair 
delay. On the other hand, if the RTT between an RS and its upstream active repair 
node is large, activating this RS can result in local repair service to downstream nodes 
(i.e., repairs can be supplied by the RS itself), providing the possibility for a significant 
reduction in repair delays. This intuition tells us that the larger the upstream RTT of a 
RS, the more important it is to activate that RS. Therefore, we use the product of the 
upstream RTT of an RS and its packet loss rate during time interval A: as a metric to 
control the activation/deactivation decision at the end of a time interval. 

In the original dynamic RS activation/deactivation algorithm m (which we will 
refer to as ALl), during a time interval (of length r = 4sec), the number of lost packets 
and the number of packets sent by the sender are measured by an RS. The packet loss 
probability of a RS’s subtree for time interval k, pf^gg, is then calculated using equation 
(8). The exact mechanism for estimating the packet loss probability, pf^gg is 

pfoss = (!-«)• Pi~l + a ■ ptgg 

where a is a smoothing parameter. At the end of each time interval, ALl compares pf^gg 
with two thresholds. If is greater than the upper threshold, then the corresponding 
RS will be active during the next time interval, otherwise if pf^gg is less than the lower 
threshold, the corresponding RS will be inactive during time interval fc -f 1. 

ALl only considers the packet loss probability of a RS’s subtree, pf^gg, as the single 
metric to control a RS’s activation/deactivation decision. In our modified algorithm 
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(which we will refer to as AL2), during each time interval the packet loss rate of a RS’s 
subtree, as well as the RS’s upstream RTT, is measured. The product of the upstream RTT 
of an RS and of its subtree is then used as the metric for RS activation/deactivation. 
We denote the product of the upstream RTT and as cj): 

cj, = RuX pLg (10) 

where denotes the round trip time from the current repair server to its nearest upstream 

repair server (or the sender). Dynamic RS activation/deactivation is controlled by a two- 
threshold mechanism based on the value of </>. If (f> is greater than the modified upper 
limit, then the corresponding RS will be activated during the subsequent time interval; if 
(j) is smaller than the modified lower limit, then the corresponding RS will be deactivated. 

4.2 Comparison of ALl and AL2 via Simulation 




We evaluate the new two-threshold policy on three topologieil. Figure 0 illustrates 
the first topology. Focusing on the topology in Figure^ nodes 1 2 3 4 6 11 12 16 17 are 
routers attached to repair servers. S represents the sender (source). The remaining nodes 
are receivers. Link propagation delays are marked in the graph in units of milliseconds. 
Lossy links are represented by dashed lines. We assume that other links are lossless. 

^ The three topologies were originally generated by Diane Kiwior of The Analytic Sciences 
Corporation, TASC 
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In order to characterize the system-wide extent to which the repair servers are acti- 
vated, we introduce the active fraction, p as follows: 



Total active duration of all RSs 
^ Total running time x Number of RSs 



(11) 



p is taken as a measure of operational cost. The other measure of interest to us is 
the average repair latency. We focus on the tradeoff between p and the average repair 
latency provided by algorithms ALl and AL2. This is done through simulation. Here by 
average repair latency we mean the average latency of recovering lost packets at all of 
the receivers. 

Note that by varying the thresholds, the fraction of repair servers that are active 
will also vary. Figure 0 shows the relationship between active fraction p and average 
repair latency, the curves being generated by changing the value of the thresholds. In 
our hgures, ALl represents the original algorithm, in which loss probability was 
the only metric considered, while AL2 represents the modihed algorithm that combines 
the parameters of upstream round trip time and pf^gg. We observe from Figure 0 that 
AL2 can produce a lower average repair latency for the same value of p. That is, while 
consuming the same amount system resources as ALl, AL2 results in the successful 
delivery of packets with lower average delay. 

As a second comparison of interest, FigureQ plots the worst average repair latency 
- the largest average repair latency experienced over all receivers - for ALl and AL2. 
Consistent with the results from Figure0 the modihed algorithm also reduces the worst 
average repair latency (for the same value of p). 




Fig. 6. Average repair latency versus active fraction, topology 1 
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Another interesting result illustrated by Figures 0 and 0 is that the average repair 
latency and worst average repair latency decrease rapidly as the fraction of active repair 
servers increases from zero (i.e., no active repair servers are ever active) to an active 
fraction of approximately 10 percent. This indicates that much of the performance gain 
by having active repair servers can be obtained by having only a relatively small fraction 
of repair servers being active. 




Fig. 7. Worst Average Repair Latency versus active fraction, topology 1 



We next introduce a new metric to compare the throughput of ALl and AL2. The 
system repair throughput overhead counts to total number of link traversals by all re- 
transmitted packets during the simulation. A smaller system repair throughput overhead 
indicates that less link bandwidth is used in the network to recover from lost packets. 
We observe, for topology 1 , that AL2 results in a smaller repair throughput overhead 
than ALl, as shown in Figure[^ 

Recall that AL2 uses the upstream RTT in determining the activation/deactivation 
status of an RS. That means, among several subtrees suffering a similar loss rate, the RS 
with a long-delay link to its parent will be activated first. For a given amount of repair 
server resource usage, this could possibly result in higher repair traffic for the system as 
a whole. Figure 0shows a second topology, for which AL2 reduces (over ALl) both the 
average repair latency over all receivers and the largest average repair latency, as shown 
in Figures [nil and [ni However, for this topology, from Figure El we notice increased 
retransmission throughput overhead incurred by using AL2 in comparison to ALl - 
the opposite of what we observed in topology 1 . This indicates that the retransmission 
throughput overhead gains in ALl versus AL2 are topology dependent. A third topology 
and its corresponding simulation results can be found in Q. 
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Fig. 8. Repair Throughput Overhead versus active fraction, topology 1 




5 Conclusion 



In this paper, we have studied the tradeoff between the time needed to successfully 
deliver data to the receiver(s) and system resource consumption in server-based active 
error recovery multicast networks. We began by developing stochastic models to study 
the distribution of repair delay both in the presence, and in the absence, of repair servers. 
Based on these models we observed that the application deadline, downstream link loss 
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Fig. 10. Average repair latency versus active fraction, topology 2 




Fig. 11. Worst average repair latency versus active fraction, topology 2 



rates, the number of receivers, and the upstream round trip time of a repair server were 
all important criteria influencing the decision of whether to activate/deactivate an RS. 

Recognizing the value of explicitly consider time-sensitive parameters in determin- 
ing when an RS should be active, and when it should not, we modified the algorithm in 
I'lDi to consider not only the packet loss rate and number of downstream receivers, but 
also the round trip time to the nearest upstream active repair server (or the sender) when 
making the activate/deactivate decision. We studied the tradeoff that exists between the 
fraction of RSs that are activated, the amount of overhead traffic, and the latency of sue- 
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Active Fraction (p) 



Fig. 12. Repair throughput overhead versus active fraction, topology 2 



cessful packet delivery. We found that our modified dynamic RS activation algorithm 
provides a significant reduction in repair delay over the original algorithm uni. while 
using no more system resources than the original algorithm. We also found that much 
of the performance gains achievable by having active repair servers can be obtained by 
having only a relatively small fraction of repair servers actually being active. 

Our future research in this area will investigate performance under bursty link loss 
rates. We expect the advantages here (over the case of statically configured repair servers 
that are always active) to be even more significant, as repair servers can be adaptively 
activated whenever burst loss occurs . An additional interesting area for study is to develop 
theoretical models that characterize the performance gains possible with active within- 
the-network servers, as a function of the density of such servers. 
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Abstract. Network measurements are crucial both to drive research and 
in network operations. We introduce a taxonomy and survey the state of 
the art of network measurement. We compare measurements available at 
the network layer with those available at higher layers, specifically the 
DNS and the Web. Both the DNS and the Web can be viewed as logi- 
cal networks; this allows the direct comparison of measurement methods 
available at these layers with those available at the network layer. We 
argue that measurement support within the DNS and the Web is insuf- 
ficient in light of the fact that they affect end-user performance as much 
as the network layer. We derive some recommendations for the reuse of 
network layer measurement methods in the DNS and the Web. 



1 Introduction 

The Internet has evolved into a system of astonishing scale and complexity, 
fraught with conflicting economic interests, comprised of subsystems from an 
unprecedented range of vendors, and burdened by many short-sighted Axes to 
fundamental problems (such as the deployment of NATs in response to a shortage 
in IP addresses). The research community has no hope of completely modeling 
the Internet EEI; only crude abstractions that focus on very specific subprob- 
lems are within our reach. Therefore, we are called upon to discover the behavior 
of the Internet, in addition to modeling aspects of it. The collection, analysis, 
and interpretation of measurements is key to this aspect of Internet research. 
It parallels other fields of study that are concerned with systems that escape 
exhaustive modeling, such as econometrics and biometrics. For example, mea- 
surements have revealed surprising statistical features in network traffic that 
were not predicted by any models before their discovery |Z3; clearly, exploratory 
measurement studies deserve an important place in the research agenda. 

Measurements are also crucial in the operation of the Internet. Traffic control 
and engineering, i.e., routing and resource allocation to make the best use of 
network resources while maximizing performance, directly depends on traffic 
measurements jSj. Other examples include accounting and billing, intrusion and 
attack detection, verification of service level agreements (SLAs), measurement- 
based admission control fsmm. etc. 

At higher layers, measuring traffic at popular Web sites as well as examining 
problems like flash crowds m Chapter 11] (sudden surge in traffic aimed at a 
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site) require measurements where typically administrative access is not available. 
Identifying the location of clients, mirroring popular sites and resources 
improving the performance of individual servers, and examining ways to move 
popular services to edge of the network require large scale measurements. 

In this paper, we give a brief survey of the state of the art of network mea- 
surement. For this purpose, it is not useful to think of the Internet is terms of the 
traditional layered model. Rather, we should think of the Internet as being com- 
posed of multiple subsystems that exhibit network structure themselves. Chief 
among these overlay networks are the domain name system (DNS) and the World 
Wide Web (consisting of clients, origin servers, proxies, caches). Other examples 
include content distribution networks (CDNs) and peer-to-peer networks, such 
as Napster and Gnutella HSl 

We argue that measurement efforts within these overlay networks face similar 
challenges and pitfalls as measurement efforts at the network layer, which are 
arguably more mature. To structure the comparison, we examine three classes 
of measurements for the network layer, for the DNS, and for the Web: topology, 
state, and traffic. The topology is the static, underlying structure of each network 
(e.g., physical links between routers, or the static configuration of a proxy in a 
browser). The state refers to dynamic changes in the active topology and other 
variables not directly related to traffic (for example, the utilization of a link would 
not be considered a state variable, while the operational state of a link is). The 
traffic refers to the flow of work through the network (e.g., packets through the 
IP network, name resolution requests and responses through the DNS network). 

Our systematic comparison illustrates the fact that logical networks, which 
affect end user performance, are severely under-instrumented today. The visibil- 
ity into the DNS and the Web is much more limited than at the network layer, 
with the result that troubleshooting, control and engineering is significantly more 
challenging for these networks. The purpose of this paper is to point out some of 
these shortcomings, and to suggest possible remedies inspired by measurement 
support at the network layer. 

The paper is structured as follows. Section 0 briefly summarizes the state of 
the art in measurement at the network layer. Section Id. II and section Id. 21 give 
an analogous assessment for the DNS and the Web. We summarize our findings 
and proposed remedies in Section 0 

2 IP Network Measurements 

In this section, we give an overview of measurements available at the IP network 
layer. For each type of measurement - topology, state, and traffic - we describe 
what measurements are available externally (i.e., without having administrative 
control over the network) and internally (with administrative access). 
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2.1 Topology 

The topology of the Internet can thought of as having two levels of hierarchy: 
the autonomous system (AS) level and the network domain level. We mainly 
focus on the domain level. 

External measurements. Several classes of methods have been proposed to in- 
fer the topology of the Internet from external measurements. The first class of 
methods, commonly referred to as topology discovery, relies on a combination of 
probing methods such as ping and traceroute, and of heuristics to sample the IP 
address space in an intelligent way to find new nodes and links m- 

The second class of methods relies on correlation in the packet loss process of 
a multicast session. Specifically, a packet loss in a multicast session is experienced 
by all the receivers downstream from the link where the loss occurred. Thus, 
by observing a large number of packets at these receivers, the structure of the 
multicast tree can be approximately inferred piTT] . This corresponds to finding 
a subgraph of the network topology. 

The third class of methods focuses on inferring other static attributes of 
the network, such as the capacity of links through active probing | 2 |, or the 
scheduling discipline m- On the Web, identifying and characterizing interme- 
diaries (such as HTTP/1.0 and HTTP/1.1 proxies) is attempted via probing 
techniques. 

Internal measurements. There are several additional sources of information 
about network topology and configuration when one has administrative con- 
trol over a network domain. Chief among these are the router configuration files, 
which provide a router’s local view of the topology, including its neighbors, links 
to and from these neighbors and their capacities. From this, it is conceptually 
easy to completely determine the physical network topology of the domain m 
Complications can arise because direct manipulation of router configuration files 
by operations personnel can lead to inconsistent configurations. 

Inferring the topology at the AS level (a graph with currently approximately 
10000 nodes) is impossible today. While it is easy to obtain a list of all the ASs, 
their connectivity depends on local public and private peering arrangements 
and the routing policies put in place by ISPs. Some heuristic methods to infer 
subgraphs of the full AS topology are described in [nisiini. 

2.2 State 

Next, we compare inferring the state of the network, assuming that the underly- 
ing topology is known. Network state includes the operational state of links and 
routers, the routing and forwarding tables in effect, and other variables that do 
not directly depend on traffic (e.g., temperature of the CPU). 

External measurements. Essentially the same tools used for external topology 
discovery can be relied upon to discover the operational state of links and routers 
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in a domain. The obvious drawbacks, as in any polling scheme, is that there 
is a potential delay between a state transition and the time it is discovered; 
this delay depends on the poll cycle. Also, the absence of a response to a ping 
packet can imply that the target router or interface is down, or that the path to 
that router/interface is down. Therefore, the results of multiple pings must be 
carefully combined to infer link and router state correctly. 

To obtain a snapshot of the state of routing, traceroute has been successfully 
used m- This basically amounts to temporally sampling a small subset of rout- 
ing table entries. Pathchar m is an extension of traceroute that is able to obtain 
rough estimates of additional path characteristics (loss and delay) . Beyond this, 
it is virtually impossible to measure other state variables from the outside. 

Internal measurements. Observing the state of network elements is the realm of 
network management protocols such as SNMP m- SNMP enables a network 
management station to query remote state variables through an agent. The re- 
mote variables are standardized as a MIB (management information base) tree. 
In addition to this polling mode, SNMP allows the definition of events that alert 
the management system synchronously to state changes. In practice, SNMP 
tends to incur a relatively high overhead in routers, and its usefulness for fine- 
grained tracking of network state is limited. 

Another method consists in intercepting link state advertisement messages 
exchanged by the intra-domain routing protocol (e.g., OSPF). This approach 
has the advantage of being authoritative, in the sense that the observed state is 
exactly the one that computation of routing tables is based upon m- 

2.3 Traffic 

We next examine methods to measure the domain-wide traffic flow. This includes 
both the load (or demand) imposed on the network domain, the routes followed 
by the incoming traffic, and its loss and delay characteristics. 

External measurements. It is virtually impossible to estimate traffic as a whole 
through a large measurement domain through external probing of that domain. 
The only method that falls into this category are recent proposals for the in- 
ference of sink trees in distributed denial-of-service (DDoS) attacks, such as IP 
traceback 1321 . In this method, routers randomly encode their address into the 
identification field of the IP header. With enough samples, a target site of a 
DDoS attack can reconstruct the sink tree of attack traffic through these en- 
coded addresses. 

Internal measurements. There are several methods proposed in the literature 
that infer domain- wide traffic statistics from different types of measurement. 

The first class of methods called network tomography relies only on measure- 
ments of link utilizations over time. The goal of these methods is to infer the 
traffic matrix, i.e., the traffic intensity between every ingress and every egress 
point 
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The second class of methods relies on aggregate flow measurements at network 
ingress and/or egress points m- A flow is an artiflcial abstraction of a set of IP 
packets with identical source-destination addresses (or address prefixes) that are 
observed close together in time [0| • This method has some drawbacks in terms of 
overhead, implementation cost, and delay; nevertheless, careful post-processing 
can yield satisfactory estimates of the domain-wide traffic flow m- 

The third class of methods uses packet sampling. Packet sampling can either 
be used at all ingress points (in analogy to flow aggregation), relying on mea- 
sured or simulated routing tables to infer the flow through the domain. Another 
method called trajeetory sampling relies on pseudo-random sampling based on 
hash functions computed over packet content to directly observe these paths [Z|- 

3 DNS and Web Measurements 

The various problems we have described in the network layer are reflected largely 
in other layers as well. In general, having administrative access over all aspects 
in applications that cross the network layer is harder. As examples, we examine 
two application areas: Domain Name System (DNS) and the Web. 

The most popular application on the Internet currently is the World Wide 
Web. In terms of traffic on the Internet, the Web is currently responsible for 
75% of the packets on the Internet. The rate of growth of traffic between the 
millions of Web users and Web sites has grown steadily for a decade. Presently 
there is significant growth at the Intranet level as well. Traffic in peer to peer 
networks due to the popularity of Napster and Gnutella is growing but they are 
a much smaller part of the overall traffic and remain largely concentrated in 
college campuses. 

Increase in user-perceived latency, redundant transfer of popular content 
across the network led to deployment of caches between the clients and the 
origin servers where resources reside or are generated. Once the usefulness of 
caching began to crest, offloading of content delivery became popular and led to 
the advent of content distribution networks (CDNs). Most CDNs use DNS-based 
redirection which has caused a significant increase in DNS traffic on the Internet. 

We begin with some background information and then examine why inferring 
topology, state, and traffic is difficult at each of these areas. 



3.1 DNS Measurements 

Topology. The topology of the DNS network consists of a collection of top- 
level domains (such as .com, .edu, .it etc.) that are just below the root of 
an hierarchy. These are then organized into separately administered zones (e.g., 
att.com). The individual zones are responsible only for registering the names 
and IP addresses of a set of authoritative DNS servers with the root servers. 
Client requests for translation of names to IP addresses and vice-versa are typi- 
cally sent by a resolver library that contacts a local DNS server. The local DNS 
server will check its cache for the request and if it does not have any pertinent 
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information it will forward the request to a root server. The root server will 
return the names and addresses of the authoritative DNS server that can help 
answer the query. The queries may proceed iteratively with each query resulting 
in a pointer to the next server to be queried or recursively, whereby the queried 
server will do the necessary work and return the result. Positive caching (for 
hits) and negative caching (for failures) with a specific time to live value is rou- 
tinely employed at the DNS servers. Most client sites have more than one local 
DNS server — one or two more serve as secondary servers for backup purposes. 
Figure ^ shows the various steps involved in resolving the address of the server 
component embedded in a URL 




Root server 



Top-level 
domain server 



Second-level 
domain server 



Local area network 



Fig. 1. DNS resolver and local DNS server 



A host relying on DNS may be able to examine the static configuration file 
(e.g., /etc/resolv . conf on many UNIX systems) but more recent systems rely 
on DHCP (Dynamic Host Configuration Protocol) for more automated config- 
urations. The extent of information available to a client is often limited to the 
names of local DNS servers. The set of authoritative DNS servers that may have 
to be contacted by the local DNS servers to resolve all the various addresses of in- 
terest is simply too large. The set of authoritative DNS servers change from time 
to time. The local DNS servers have to be kept operational at all times since 
in their absence applications will not be able to conduct any remote network 
activity. 

Even when the set of DNS servers are within a single administrative domain, 
there is no simple mechanism to obtain a list of the configured collection since 
they are distributed across a large number of machines. An attempt to walk 
the DNS tree hierarchy by using tools like dig Q starting at the zone from a 
server and examining the name server (NS) records would still miss servers that 
are not authoritative (caching-only servers) and the unofficial secondary servers 
used for fault tolerance purposes. Additionally, access control lists placed on 
zones will make this process even harder. The model of interaction with remote 
DNS servers is hop by hop with little or no knowledge available about anything 
beyond the first hop. There is no equivalent of traceroute at the DNS layer. 
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State. Changes occur often at the DNS layer: new domains are registered, old 
ones change, cache information becomes stale, etc. The only way to learn about 
changes is upon a request and failure is often based on a timeout model. Consider 
the problem of a common mis-configuration known as lame delegation: a set of IP 
addresses have been registered as authoritative for a particular domain. Suppose 
one of them is incorrect: it is either not an authoritative server for that domain 
or does not even run a name server. Queries sent to this server will time out and 
be resent leading to additional unnecessary traffic. Note that a lame server may 
be contacted directly or as a result of redirection from another authoritative 
server for the zone. 

It has been estimated that there may be up to 25% of all DNS zones with lame 
delegations. There are two problems in identifying and removing lame delega- 
tions. The lame delegations are rarely discovered since often the only indication 
of their presence is additional latency and a redundant server will eventually 
answer the query. Even if they are identified, fixing the problem requires in- 
teractions with administrators (who are hard to locate). Finally, there is no 
guarantee that the problem will indeed be fixed. Although the rate of change 
of information at the DNS layer is significantly less than at the network layer, 
issues of scale and distributed control make it harder to determine and correct 
state problems. 



Traffic. There are no known mechanisms to trace the DNS traffic as it percolates 
through the hierarchy. Although logs are maintained on several servers (primar- 
ily to detect scan attempts by hackers) they disclose at best partial information. 
The handing off of control to a backup server via DNS’s zone transfer mecha- 
nism (a common occurrence) is not known to other servers since the mechanism 
is meant to be transparent to the end users. The recent significant increase in the 
DNS traffic due to overlay networks like Akamai (whereby URLs of content dis- 
tributed resources are replaced with alternate CDN company specific ones whose 
address resolution can be controlled for load balancing purposes) became visible 
only because the rewritten URLs can be seen in the HTML text. Other CDNs 
do dynamic URL rewriting making such detection even harder. DNS servers of 
CDN companies often give out short TTLs to have more fine-grained control 
over the use of the mirror servers. With each short TTL expiration the CDN can 
balance the load on its network of servers. However, this results in additional 
DNS traffic with questionable performance improvements. Furthermore, obtain- 
ing a complete list of sites to which requests get redirected is hard since the CDN 
resolution is designed to give different answers at different times depending on 
the client’s location. 

Inside the network, one can obtain flow records via tools like netfiow or 
by running packet tracing programs like tcpdump. In both cases, the task of 
extracting DNS traffic for purpose of identifying problems is pretty complex 
although there is currently some work in progress in this direction [24j . The 
problems at the internal level include obtaining a complete view of all DNS 
traffic entering and exiting the network. Even if the complete information is 
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obtained at high cost, the local configurations are not known which results in at 
best a partial view of what is actually occurring. 

3.2 Web Measurements 

As mentioned earlier, Web traffic accounts for 75% of all packets on the Internet. 
A Web transfer requires interaction with a variety of protocols and often with 
multiple entities on the Internet. The suite of protocols include the Domain Name 
System (DNS) protocol, the transport layer protocol (often TCP) for transport- 
ing the requests and responses reliably between the Web client and the server, 
and HTTP (HyperText Transfer Protocol |1 — the protocol underlying the 

World Wide Web and serves as the language of Web messages. 

A single user click can result in the involvement of several entities |23 Chap- 
ter 15] including the following: 

— A browser (and often a client side cache). 

— Several Web intermediaries such as Web proxies or gateways. The proxy may 
be directly configured, be invisible to users (interception proxy), or be part 
of a large proxy farm. 

— A surrogate server in front of the actual Web server deployed to balance the 
load at the Web site. 

— The DNS server at the client/proxy side and additional redirected lookups 
due to rewritten URLs (due to content distribution overlay networks such as 
Akamai or Digital Island) and advertisement servers (who contribute images 
and text to the full container document). 

The number of parties involved on an end to end basis in a single Web 
transaction can be more than a handful. 



Topology. The resources requested on the Web by clients like browsers (or 
quite often programs like spiders), can often be served from different locations 
either locally through surrogates on the server side or remotely at mirror sites. 
The set of entities on the Web include clients, proxies, gateways, servers, sur- 
rogates, mirrors. The choice of mirrors is dynamically decided. A user’s request 
may traverse several of these entities. Often the user has only control over the 
next-hop proxy and thus identifying all the entities is hard. The recent advent 
of HTTP/ 1.1 version of the protocol [li 2123122] allows intermediaries to iden- 
tify themselves through a new HTTP header (Via). But given the widespread 
prevalence of HTTP/ 1.0 proxies in the Internet for the foreseeable future, those 
entities may not participate in this enhancement. Furthermore, the presence of 
interception proxies (the ones that dip into the network layer to examine sus- 
pected HTTP traffic and possibly redirect them) exacerbates the difficulty of 
knowing all the entities that are involved in a transaction. 



State. A resource may be cached at a proxy or at a dynamic mirror site. For load 
balancing and fault tolerance purposes a set of resources may be available from 
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multiple sites in case some sites are inaccessible at any given time. Learning 
about the state of a specific server is hard due to possible redirection at the 
HTTP or DNS level. 

The widespread presence of caching proxies makes it much harder to deter- 
mine the actual number of requests generated for a particular resource. Down- 
loading a container document (such as an HTML file which includes links to one 
or more embedded resources, such as images or animations) requires contacting 
several servers due to the growing use of CDNs and the presence of advertise- 
ments (often located on remote machines). Since the CDNs often dynamically 
decide the mapping between strings and the actual machines that serve the dis- 
tributed content, it is not possible to obtain stable latency metrics. The end to 
end measurements of interest include user perceived latency, load on the network, 
and load on the servers. However, inferences regarding load on a remote server 
is very hard to obtain. Simple hacks like examining TCP sequence numbers can 
be risky in the presence of redirections at the HTTP and DNS layer. Caches 
introduce the well known problem of staleness: a significant fraction of HTTP 
requests are validation queries to ensure that a cached resource is the same as 
the current instance on the origin server. There are risks to a cache assigning 
freshness time overriding the origin server’s wishes. 



Traffic. The end to end traffic on the Web is both simply too large and too 
complex to estimate. Companies like Media Metrix and Keynote attempt to 
present sampled figures by examining traffic at a few interchanges and extrap- 
olating from them. As discussed above, such studies miss the cached responses, 
failures, etc. Some studies have been carried out to perform end to end measure- 
ments that examine improvements due to the new version of the protocol 
such as reduction in the number of TCP connections due to persistent connec- 
tions feature, cache effectiveness, content delivery from multiple sites (CDNs, 
ad servers etc.), and latency reduction due to the ability of downloading partial 
responses. Yet, there is not a statistically reliable sampling technique for esti- 
mating end to end traffic, due to the complexity of the Web, the widespread 
prevalence of intermediaries, and implementations that are not compliant with 
the protocol specification, etc. 

Even on an intra-net level, a Web server may not know what fraction of 
requests directed towards it reach it eventually due to the possibility of proxy 
cache farms in the path. The server would have to know about the configuration 
information of all clients in order to obtain a good estimate or indulge in cache 
busting j22j. 

Currently the best known traffic artifacts are logs maintained at the proxy 
and server level. The logs record several fields including the IP address of incom- 
ing request, time of request, the HTTP method, URL, protocol version, response 
code and content length. Even this relatively small subset of items logged has 
problems associated with how they are interpreted. The ‘client’ IP addresses 
recorded could be the last hop proxy and not the original client. A long response 
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in transit may never reach the client who may have aborted the request; yet the 
server log might indicate that several thousands bytes of response were sent. 

4 Conclusion 

We have argued that network measurements are crucial both to drive funda- 
mental discoveries and for the purpose of control and engineering. At the IP 
network layer, this need is fairly obvious, and instrumentation support at the 
network layer is reasonably mature as a result of pressure on vendors to include 
measurement support in their products. However, end-user performance depends 
as much on the performance of logical networks such as the DNS and the Web 
as on the network layer proper. This suggests that the granularity and scope 
of measurements available at these layers should match that at the network 
layer. We have illustrated in this paper that this is not the case through a direct 
comparison of the state of the art at the network layer with the DNS and the 
Web. 

We have described network measurements for topology, state, and traffic. 
There is a range of methods for each category. Administrative access to a network 
domain is usually required to gain access to measurements of sufficient quality 
to perform traffic engineering. Nevertheless, a range of clever methods are also 
available to obtain useful snapshots of network topology, routing state, and traffic 
loads. 

At the higher layers the problem of inference is more complicated since topol- 
ogy, state, and traffic have several additional entities and hidden artifacts (some 
of which are known, such as lame delegations at the DNS layer). Additionally, 
interesting questions such as user-perceived latency or end-to-end delay are in- 
herently more complicated due to the involvement of multiple protocols and 
intermediaries. The problem is further compounded by implementations of Web 
components that are not fully compliant with the protocol specification m 
In some cases application level protocols have attempted to mimic some of the 
useful ideas in the network layer. For example, HTTP/1.1 introduced the Via 
and Max-Forwards header to expose some of the topology information about 
intermediaries and to better target the requests. 

We believe that the additional measurement support at the application layer 
could be inspired by tools and methods that have proved valuable at the net- 
work layer. For example, an equivalent of traceroute in DNS to track a request, 
pathchar in HTTP to derive performance properties of nodes (proxies) on the 
path to the server, or native support for request sampling, would be useful for 
troubleshooting, testing, and control. While the details of such a suite of higher- 
layer tools would certainly reflect the application area, we hope that the foregoing 
discussion has shown conceptual similarity between the network layer and other 
application areas to motivate the reuse of the expertise gained at the network 
layer. 
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Abstract. In this paper we describe an analytical approach to estimate the perfor- 
mance of greedy and short-lived TCP connections, assuming that only the primitive 
network parameters are known, and deriving from them round trip time, loss prob- 
ability and throughput of TCP connections, as well as average completion times 
in the case of short-lived TCP flows. It exploits the queuing network paradigm 
to develop one or more ‘TCP sub-models’ and a ‘network sub-model,’ that are 
iteratively solved until convergence. Our modeling approach allows taking into 
consideration different TCP versions and multi-bottleneck networks, producing 
solutions at small computational cost. Numerical results for some simple single 
and multi-bottleneck network topologies are used to prove the accuracy of the 
analytical performance predictions, and to discuss the common practice of apply- 
ing to short-lived TCP flows the performance predictions computed in the case of 
greedy TCP connections. 



1 Introduction 

Dimensioning, design and planning of IP networks are important problems, for which 
satisfactory solutions are not yet available. Many network operators and Internet Service 
Providers (ISPs) are dimensioning their networks by trial and error, and the more so- 
phisticated dimensioning approaches are often still largely based on the packet network 
design algorithms devised in the ’70s JO. These rely on Poisson assumptions for the 
user-generated packet flows, thus completely neglecting the recent findings on traffic 
self-similarity, as well as the closed-loop congestion control algorithms of TCP, that 
carries about 90% of the Internet packets. A careful solution to the IP network dimen- 
sioning problem must be based on performance models capable of accurately estimating 
the efficiency of the applications run by end users. This translates in the need for models 
capable of accurately forecasting the throughput of TCP connections and the corre- 
sponding delay in the transfer of files. A large number of approaches have been recently 
proposed to estimate the performance of TCP connections interacting over a common 
underlying IP network (a brief overview of some of the recent proposals is presented in 
Section|2|). They can be grouped in two classes: 
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Fig. 1. Interaction between the TCP and the network sub-models. 



1. models that assume that the round trip time and the loss characteristics of the IP 
network are known, and try to derive from them the throughput (and possibly the 
delay) of TCP connections; 

2. models that assume that only the primitive network parameters (topology, number of 
users, data rates, propagation delays, buffer sizes, etc.) are known, and try to derive 
from them the throughput and possibly the delay of TCP connections, as well as the 
round trip time and the loss characteristics of the IP network. 

Often, models in the second class incorporate a sub-model similar to the models of the 
first class to account for the dynamics of TCP connections, together with a sub-model 
that describes the characteristics of the IP network that carries the TCP segments. The 
two sub-models are jointly solved through an iterative fixed point algorithm (FPA). This 
is the approach that we adopt also in this paper, using the queuing network modeling 
paradigm to develop both the TCP sub-model’ and the ‘network sub-model’. Actually, 
with our approach, the TCP sub-model itself can be made of several components, each 
describing a group of ‘homogeneous’ TCP connection, where with such term we indicate 
TCP connections sharing common characteristics (same TCP version, comparable round 
trip time, similar loss probability). Each group is modeled with a network of M/G/oo 
queues, that describes in detail the dynamics of the considered TCP version, as explained 
later in Section^ Instead, the network sub-model is made of a network of /M jljBi 

queues, where each queue represents an output interface of an IP router, with its buffer 
of capacity Bi packets. The routing of customers on this queuing network reflects the 
actual routing of packets within the IP network, as explained later in Section 0] 

A graphical representation of the modeling approach is shown in Fig.Ql The TCP 
sub-model comprises several components referring to homogeneous groups of TCP 
connections. Each component receives as inputs (from the network sub-model) the ap- 
propriate estimate of the packet loss probability along the TCP connection routes, as well 
as the estimates of the queuing delays at routers. Each component produces estimates of 
the load generated by the TCP connections in the group. The network sub-model receives 
as inputs the estimates of the load generated by the different groups of TCP connections, 
and computes the loads on the network channels, and the packet loss probabilities and 
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average queuing delays. These are fed back to the TCP sub-models in an iterative pro- 
cedure that is stopped only after convergence is reached. Thus, our modeling approach 
allows taking into consideration different TCP versions and multi-bottleneck networks. 
In addition, our modeling approach allows considering both long- and short-lived TCP 
connections, as explained in Section 0 and obtains solutions at very little cost, even in 
the case of thousands TCP connections that interact over the underlying IP network. The 
input parameters of the whole modeling approach are just the primitive characteristics 
of the protocol and the physical network (channel lengths and data rates, buffer sizes, 
packet lengths, file sizes, TCP versions, TCP connection establishment rates, maximum 
window sizes, etc.) All other parameters are derived from the protocol and network char- 
acteristics. The convergence of the FPA used in the solution of the overall composed 
model was studied and proved in several cases. 



2 Related Work on TCP Models 



The analysis of the behavior of TCP presented in 11 and 0 is based on measurements. 
In Q the modeling technique is empirical, while in 0 the model is based on the analysis 
of TCP transmission cycles. The packet loss ratio and connection round trip time are 
needed as inputs to both models in order to allow the derivation of the TCP throughput 
and average window size. While the first paper assumes that losses are not correlated, 
the second paper also takes into account the correlation in packet losses. An extension 
to this latter paper that allows computing the latency of short file transfers is presented 
in 0] . Modeling techniques based on differential equations are used in 0 and |i5l. A 
fluid flow model is used in 0 to compare the performances of TCP-NewReno and TCP- 
Vegas. The authors of 0 use differential stochastic equations to model the combined 
macroscopic behavior of TCP and the network itself. The resulting model can cope with 
multiple interacting connections, defining a closed-loop system that can be studied with 
powerful control-theoretic approaches. Markovian modeling is used in 0, that presents 
Markov reward models of several TCP versions on lossy links. Also the work in EH 
is based on a Markovian approach. The novelty in these works is the consideration of 
connections that switch on and off, following a two-state Markov model; however, the 
main performance figure is still the steady-state TCP connection throughput, since the 
model cannot provide estimates of completion times. Finally, llfil 11121131 introduce 
the TCP modeling technique that is adopted in this paper. The models in both Q OI and 
HQ consider greedy connections; the models in inini consider instead finite-size file 
transfers. While the models in mnm always assumed that just one TCP version 
is used over the IP network, the model in ITU considers the simultaneous presence of 
TCP-Tahoe and TCP-NewReno connections. 



3 TCP Sub-models 

The behavior of concurrent TCP flows is described with a network of MjGjoo queues, 
where each queue describes a state of the TCP protocol, and each customer stands for 
an active TCP connection; thus, the number of customers at a queue represents the 
number of TCP connections that are in the corresponding state, and the times spent at 
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queues represent the time spent by TCP connections in states. Actually, even if we use 
the queuing network notation, that is most familiar to telecommunications engineers, 
the TCP sub-model can also be viewed as a stochastic finite state machine (FSM), 
which can be derived from the FSM description of the protocol behavior. When greedy 
TCP connections are studied, a fixed finite number N of active TCP connections must 
be considered; correspondingly, in the TCP sub-model, N customers move within the 
queuing network, which is closed. Instead, in the case of short-lived TCP connections, 
the model must consider a number of active TCP connections that dynamically varies 
in time: a TCP connection is first opened, then it is used to transfer a file composed of a 
finite number of packets, and finally, after the file has been successfully transferred, the 
connection is closed. This scenario is modeled by an open queuing network: a customer 
arriving at the queuing network represents an opening connection; a customer leaving the 
queuing network represents the file transfer completion. Moreover, in order to properly 
model the evolution of a short-lived TCP connection, the finite size of the file to be 
transferred has to be taken into account. Thus, besides accounting for the protocol state, 
the description of the behavior of a short-lived TCP connection has to be enriched with a 
notion of the amount of data to be transferred. This is modeled by introducing customer 
classes: the class of a customer represents the number of packets that still have to be 
transferred. A TCP connection which must be used to transfer a file using c packets 
is represented by a customer arriving at the queuing network in class c. Whenever the 
customer visits a queue that corresponds to the successful transfer of a packet, the class 
is decreased by one. When the customer class reaches zero (meaning that the last packet 
has been successfully transferred), the customer leaves the queuing network. 

In order to illustrate the flexibility of the proposed modeling approach in both dealing 
with different TCP versions and modeling the different types of TCP connections, we 
describe in some detail the case of greedy TCP-Tahoe connections (Section ITIT) and the 
case of short-lived TCP-NewReno connections (Section lT^ . 

3.1 Greedy TCP-Tahoe Connections 

We describe the queuing network model of greedy TCP-Tahoe connections, assuming 
that fixed- size TCP segments are transmitted over the underlying IP network. Each queue 
in the queuing network is characterized by the congestion window size (cwnd) expressed 
in number of segments. In addition, the queue describes whether the transmitter is in 
slow start, congestion avoidance, waiting for fast retransmit or timeout. Fig.Qshows the 
queuing network when the maximum window size is FF = 10 segments. Queues are 
arranged in a matrix pattern: all queues in the same row correspond to similar protocol 
states, and all queues in the same column correspond to equal window size. Queues 
below Rq model backoff timeouts, retransmissions, and the Karn algorithm. The queuing 
network comprises 1 1 different types of queues, shortly described below. 

Queues Ei (1 < * < FF/2) model the exponential window growth during slow start; 

the index i indicates the congestion window size. 

Queues ETi {1 < i < W/2) model the TCP transmitter state after a loss occurred 
during slow start: the congestion window has not yet been reduced, but the transmitter 
is blocked because its window is full; the combination of window size and loss pattern 
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Fig. 2. Closed queueing network model of TCP-tahoe. 



forbids a fast retransmit (i.e., less than 3 duplicated ACKs are received), so that the 
TCP source is waiting for a timeout to expire. 

Queues EFi (4 < i < hP/2) model a situation similar to that of queues ETi, where 
instead of waiting for the timeout expiration, the TCP source is waiting for the 
duplicated ACKs that trigger the fast retransmit. 

Queues Li {2 < i <W) model the linear growth during congestion avoidance (no- 
tice that queue Li does not exist). 

Queues Ft (4 < i < W) model losses during congestion avoidance that trigger a fast 
retransmit. 
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Queues TOi (2 < i < W) model the detection oflosses by timeout during congestion 
avoidance. 

Queues Ti (1 < i < C) model the time lapse before the expiration of the [i + 1)— th 
timeout for the same segment, i.e., the backoff timeouts. C is the maximum num- 
ber of consecutive retransmissions allowed before closing the connection. Queue 
Tq, actually models TCP connection that were closed for an excessive number of 
timeouts. In TCP-Tahoe this happens after 16 retransmissions, i.e., C = 16. Closed 
connections are supposed to re-open after a random time; which we chose equal to 
180 s, however this value has a marginal impact on results. 

Queues Ri (0 < i < C — 1) model the retransmission of a packet when timeout ex- 
pires. 

Queues EKi (1 < i < C — 1) model the first stage of the slow start phase (i.e., the 
transmission of the hrst 2 non-retransmitted packets) after a backoff timeout. During 
this phase the Karn algorithm ( fl311 and m Ch. 21.3) has a deep impact on the 
protocol performance under heavy load. 

Queues TKf {1 < i < C — 1) model the wait for timeout expiration when losses 
occurred in queues EKi leaving the congestion window at 2. 

Queues TKf (1 < i < C — 1) model the wait for timeout expiration when losses 
occurred in queues EKi leaving the congestion window at 3. 

For the specification of the queuing network model, it is necessary to dehne for each 
queue: the average service time, which models the average time spent by TCP connec- 
tions in the state described by the queue; and the transition probabilities P{Qi,Qj), 
which are the probabilities that TCP connections enter the state described by queue Qj 
after leaving the state described by queue Qi. Depending on the queue type, the service 
time is a function of either the average round-trip time or the timeout. For example, the 
average time spent in queue Li (window size equal to i and congestion avoidance) is 
equal to the round-trip time; the average time spent in queue Ti {{i + 1)— st timeout 
expiration) is equal to 2®To, where Tq is the initial timeout and the term 2® accounts for 
back-offs. Similarly, all other queue service times are set. 

The probabilities P{Qi, Qj) that customers completing their service at queue Qi 
move to queue Qj, can be computed from the dynamics of TCP. These dynamics depend 
on the working conditions of the network, i.e., they depend on the round-trip time as 
well as the packet loss probability. In order to cope with the correlation among losses 
of packets within the same congestion window, we introduce two different packet loss 
probabilities: the probability of loss of the first segment of the active sliding window, 
Pl^ (where / stands for ‘first’), and the loss probability for any other segment in the 
same window Pl^ (where a stands for ‘after’ the hrst segment loss). Notice that, due to 
loss correlation, Pl^ is much larger than Flj- Besides depending on Pl^ and the 
probabilities of moving from queue to queue are also function of the distribution of the 
window growth threshold ssthresh. We denote by Pt(*) the probability that ssthresh is 
equal to i. For example, the probability that a customer moves from queue Ei to queue 
E2 equals the probability that the segment transmitted while in queue Ei is successfully 
delivered, i.e, 1 — Plj. Consider now queue Ei, which represents the transmission of 
two packets following an ACK reception in slow start with window size equal to i. After 
visiting queue Ei, a customer moves to queue if both transmissions are successful 
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and the threshold ssthresh is not reached; the probability of this event is equal to 



P(E,,E,+i)={l-PLfy 



EZ^Prik) ■ 



The first term accounts for the probability that both segments are successfully transmit- 
ted; while the second term is the probability that the threshold is larger than i. Instead, 
the customer moves from Ei to Li when the threshold is equal to i; the corresponding 
transition probability is 



P{E,,L,) 






Prii) 

Et^Prik)' 



Similarly, all other transition probabilities can be derived. The distribution of Prik) is 
determined based on the window size distribution (see lO). 

In order to compute the average load A offered to the network by TCP connections, 
we need to derive the load Aq offered by connections in the state described by queue Q. 
If we denote with Pq the number of packets offered to the underlying IP network by a 
connection in queue Q, the actual load offered by queue Q is Aq = XqPq, where Ag 
is the customer arrival rate at queue Q. The terms Aq are computed from the queuing 
network solution, assuming a total of N customers (connections); the terms Vq are 
specified by observing the TCP behavior. For example, each connection in queue Ei 
(slow-start growth and window size equal to i) generates two segments; each connection 
in queue Li (congestion avoidance and window size equal to i) generates i segments. 
See jni for a more detailed description of the service times and transition probabilities. 



3.2 Short-Lived TCP-NewReno Connections 

The main difference between the Tahoe and NewReno TCP versions lies in the presence 
of the fast recovery procedure m, which permits to avoid the slow start phase, if 
losses are detected via duplicated ACKs, entering congestion avoidance after halving 
the congestion window. The number of queues and queue types needed for the description 
of TCP-NewReno is approximately the same as for TCP-Tahoe, being the states of the 
protocol and the values that the window can assume the same. The differences are mainly 
in the transition probabilities, which reflect the different loss recovery strategies of the 
two protocol versions. For instance, during the NewReno fast recovery procedure, the 
actual transmission window keeps growing, to allow the transmission of new packets 
while the lost ones are recovered. In addition, when considering short-lived instead 
of greedy connections, special care is needed in modeling the first slow-start growth, 
during which the threshold sssth has not been set yet. The impact of this period can be 
remarkable, specially for very short flows. In fhe development of the queuing network 
model of short-lived NewReno TCP connections, the following classes of queues are 
added, with respect to those shown in Fig.|21 

Queues EEi{l < i < W) model the exponential window growth during the first slow 
start phase after the connection is opened; the index i indicates the transmission 
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window size. During the first slow start phase the window can grow up to W, and 
the growth is limited only hy losses, since ssthresh is not yet assigned a value 
different from W . 

Queues ETi {W/2 < i < W) are similar to queues ETi (1 < i < W) hut can he 
reached only from queues FEi, i.e., during the first slow start on the opening of the 
connection. 

Queues FEi {4 < i < W) model a situation similar to that of queues EFi (4 < z < 
PP/2) but are entered from queues FEi in the first slow start phase. 

Queues ETi {I < i < C) model the time lapse before the expiration of the z-th or (j + 
1)— St timeout in case the first segment is lost, when the RTT estimate is not yet 
available to the protocol and Tq is set to a default value, typically 12 tics. This event, 
that may seem highly unlikely, has indeed a deep impact on the TCP connection 
duration, specially when the connection is limited to a few or even a few tens of 
segments, which is one of the most interesting cases. 

Queues FR,{0<i<C), FEKi{0<i<C), FTKf {0<i<C), FTKf {0<i<C) 
are similar to queues Rj, EKj, TKj and TKj and refer to packet retransmission 
and timeout expiration for losses during the first slow start phase. 

Lack of space forbids a detailed description of this model, giving all the queues 
service times and transition probabilities. The interested reader can find all these infor- 
mations in Ell- 

When considering short-lived connections, a customer leaving a queue may also 
change class (remember that the class indicates the number of packets still to be trans- 
mitted before closing the connection). We denote hy P{Qi,c]Qj,c — k) the probability 
that a customer leaving queue Qi in class c moves to queue Qj in class c — k, meaning 
that k packets were delivered in the protocol state represented by queue Qi. The value 
of k depends on the protocol state, on the packet loss probability, and on the window 
size. For example, after visiting a queue which represents the transmission of two con- 
secutive packets in exponential window growth (queues FEi or Ei) k can be equal to 0, 
1, 2, depending on the number of lost packets. Leaving queues Li (linear growth mode, 
window size z), k is equal to z if all packets in the window are successfully transmitted; 
k is smaller than z if some of the packets are lost. Again, due to space limitation, we 
cannot provide more details about transition probabilities, and refer readers to lO for 
further informations. 

A TCP connection in a given queue Q and class c generates a number of packets 
equal to Vq^c, which is the minimum between c and the number of packets whose 
transmission is allowed by the protocol state. The load offered by connections in queue 

^max 

Q is given by, Aq = ^q,c'Pq,c where Aq c is the arrival rate of customer in class c 

C— 1 

at queue Q, and At™ “ is the maximum considered file size. The average load offered to 
the network is then A = ^Q’ where S is the set of all queues. The terms Vq^c are 

computed similarly to the class decrement k discussed above. The difference consists 
in that Vq^c includes all the packets that are submitted to the network, while k accounts 
for successfully transmitted packets only. 
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4 Network Sub-models 

The network sub-model is an open network of queues, where each queue represents 
an output interface of an IP router, with its buffer. The routing of customers on this 
queuing network reflects the actual routing of packets within the IP network. In the 
description of the network sub-model, we assume that all router buffers exhibit a drop- 
tail behavior. However, active queue management (AQM) schemes, such as RED, can 
also be considered, and can actually lead to faster convergence of the FPA used in the 
solution of the complete model. 

Different approaches, with variable complexity, were tested to model queues at router 
output interfaces. When dealing with greedy TCP connections, the best tradeoff between 
complexity and accuracy was obtained by using a simple M/M/1/ Bi queue to model 
each router output interface. This queuing model was observed to provide sufficiently ac- 
curate estimates of packet loss probability and average round-trip time with very limited 
cost; other approaches with significantly greater complexity provided only marginal 
improvements in the estimate accuracy. The Poisson packet arrival process at queue 
M /M /!/ Bi has rate Ai, which is computed from the TCP sub-model(s) and the TCP 
connection routes. The average customer service time equals the time to transmit a TCP 
segment, i.s.,8-MSS/Ci s, where MSS is the Maximum Segment Size in bytes, and Ci 
is the data rate on the i-th channel in b/s. The average segment loss probability can be 

derived from the M /M/1/ Bi queue solution. From P^i, the loss probability of the first 
segment, Phif , and the loss probability of any other segment of the same window, Pua ■> 
can be derived. In 0 the authors assume that the intra-connection correlation among 
losses is 1, i.e., they assume that, after a loss, all the remaining packets in a congestion 
window are also lost, though the first lost segment is not necessarily the first segment 
in the window. In our case, considering that the TCP behavior is such that the first lost 
packet in a burst of losses is the one in the lowest position in a window (ACKs make the 
window slide until this is true), it seems more realistic to assume that the loss rate within 

a burst is obtained by scaling the initial loss rate: P^i = aiP^i, 1 < Qfi < , 

“ Phtf 

with the constraint that the average loss rate is respected. Empirical observations show 
that the loss correlation becomes stronger as the window size grows; to catch this be- 
havior we set a = w, subject to the above constraint, where W is the average congestion 
window size computed by the TCP sub-model for flows crossing the i-th channel. The 
TCP connection average round-trip times are computed by adding propagation delays 
and average queueing delays suffered by packets at router interfaces. When dealing with 
short-lived TCP flows, the M/M/l/Bi queuing model fails in providing satisfactory 
results. This is due to the fact that in this case the traffic is much more bursty than for 
greedy TCP connections. In fact, the fraction of traffic due to connections in slow-start 
mode is much larger, and slow-start produces more bursty traffic than congestion avoid- 
ance. We chose to model the increased traffic burstiness by means of batch arrivals, 
hence using /M/l/Bi queues, where the batch size varies between 1 and Wi with 
distribution [D], The variable size batches model the burstiness of the TCP transmis- 
sions within the TCP connection round-trip time. The Markovian assumption for the 
batch arrival process is mainly due to the Poisson assumption for the TCP connection 
generation process (when dealing with short-lived connections), as well as the fairly 
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Fig. 3. Abstract representation of the Internet as seen from hosts connected to the LAN of Politec- 
nico di Torino; white circles represent TCP clients, grey circles TCP servers 



large number of TCP connections present in the network. The average packet loss ratio, 
Pn, and the average time spent by packets in the router buffer, are obtained directly 
from the solution of the /!/ Bi queue; the average round-trip times are then 

estimated by adding queueing delays and average two-way propagation delays of TCP 
connections. The key point of the modeling process in this case is the determination of 
the batch size distribution [D], We compute the batch sizes starting from the number of 
segments Nm sent by a TCP transmitter during a round-trip time. Clearly, this is a gross 
approximation and a pessimistic assumption, since no two TCP segments are transmit- 
ted at the same time, and arrive at the same time at the router buffers. To balance this 
pessimistic approximation, the batch size is reduced by a factor /i < 1 . Actually, since 
the TCP transmission burstiness is much higher during slow start than during congestion 
avoidance, we use two different factors; during the exponential window growth, and 
p,* during the linear window growth. Unfortunately, it was not possible to find a direct 
method for the computation of these two factors, but a simple heuristic optimization led 
to the choice /i® = 2/3, = 1/3, that yielded satisfactory results in all tested scenarios. 

5 Numerical Results 

In this section we present some selected results, trying to highlight the flexibility of the 
modeling technique and the accuracy of the performance indices that can be evaluated 
using our methodology. Most results are validated against point estimates and confidence 
intervals obtained with ns-2 simulations [O- 

5.1 The Network Topology 

The networking setup we chose to assess the results obtained with the queueing network 
models of TCP, closely resembles the actual path followed by Internet connections from 
our University LAN to web sites in Europe and the USA. A schematic view of the network 
topology is shown in Fig.El at the far left we can see a set of terminals connected to 
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the internal LAN of Politecnico di Torino. These terminals are the clients of the TCP 
connections we are interested in (white circles in the hgure represent TCP clients, grey 
circles represent TCP servers). The distance of these clients from the Politecnico router 
is assumed to be uniformly distributed between 1 and 10 km. The LAN of Politecnico 
is connected to the Italian IP network for universities and research institutions, named 
GARR-B (a portion of the European TEN-155 academic network), through a 10 Mb/s 
link whose length is roughly 50 km (this link will be called poli-to). Internally, the 
GARR-B/TEN-155 network comprises a number of routers and 155 Mb/s links. One of 
those connects the router in Torino with the router in Milano; its length is set to 100 km. 
Through the GARR-B/TEN-155 network, clients at Politecnico can access a number of 
servers, whose distance from Politecnico is assumed to be uniformly distributed between 
100 and 6,800 km. From Milano, a 45 Mb/s undersea channel whose length is about 
5,000 km reaches New York, and connects the GARR-B network to the north- American 
Internet backbone (this link will be called mi-ny). Many other clients use the router in 
Milano to reach servers in the US. The distance of those clients from Milano is assumed 
to be uniformly distributed between 200 and 2,800 km. Finally, the distance of servers 
in the US from the router in NY is assumed to be uniformly distributed between 200 
and 3,800 km. For simplicity, we assume one-way TCP connections with uncongested 
backward path, so that ACKs are never lost or delayed. We consider three types of TCP 
connections: 

1 . TCP connections from US servers to Politecnico di Torino clients; with our previous 
assumptions, the total length of these connections ranges from 5,351 to 8,960 km. 

2. TCP connections from GARR-B/TEN155 servers to clients at Politecnico di Torino; 
connections lengths range between 151 and 6,860 km. 

3. TCP connections from US web sites to users connected to the GARR-B/TEN- 
155 network through the mi-ny link; connections lengths range between 5400 and 
11,600 km. 

Given the link capacities, congestion occurs in either the 10 Mb/s poli-to channel 
(which is crossed by connections of types 1 and 2) or the 45 Mb/s mi-ny channel 
(which is crossed by connections of types 1 and 3), or both. The packet size is assumed 
constant, equal to 1,024 bytes; the maximum window size is 64 packets. We consider 
variable buffer sizes, and the default TCP tic value, equal to 500 ms. Connections that 
cross both links are those of group (1) and are assumed to be 25% of all connections 
traversing poli-to link (Nl). 

When the poli-to channel is congested, and the mi-ny channel is lightly loaded, the 
scenario is called Local Access, since the bottleneck is only in the local access link. 
When instead the Mi-NY channel is congested, and the poli-to channel is lightly loaded, 
the scenario is called USA Browsing, since the bottleneck is only in the transoceanic 
link. In both these cases, the topology practically reduces to a single bottleneck network. 
When both channels are congested, the topology is a two-bottleneck network. 

5.2 Results for Greedy TCP Connections 

In this section we discuss some of the results obtained for greedy connections with the 
closed queueing network model. Due to space limitations, we only consider connections 
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traversing both bottlenecks, i.e., TCP connections of type (1) in the networking scenario 
discussed above, and we just show numerical results for loss probability and average 
congestion window size. A discussion of the results obtained in the case of single bottle- 
neck networks (Local Access or USA Browsing scenarios) can be found in jni, 
where we show that our model also derives very accurate estimates for the distribution 
of the TCP congestion window and slow start threshold. This proves that the model 
actually captures most of the internal dynamics of the TCP protocol, not only its average 
behavior. In CD we also show how the model can be used to compute the performance 
(e.g., the throughput) of one specific connection chosen from all those that interact over 
the IP network. 

Fig.0 reports results for the packet loss probability of type (1) TCP connections. 
The top plot presents analytical and simulation results versus and N 2 (the numbers 
of connections on the poli-to and Mi-NY channels, respectively); the bottom plot 
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Fig. 4. Packet loss probability when connections cross both bottlenecks versus the number of 
connections N\ and N 2 (top plot) and versus A^i with constant N 2 (bottom plot) 
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Fig. 5. Average window size when connections cross both bottlenecks versus the numbers of 
connections Ni and N2 




Fig. 6. Packet loss probability when connections cross both bottlenecks versus the numbers of 
connections Ni and N2 ; scenario with increased channel capacity 



presents analytical and simulation results versus Ni for different values of N 2 - The 
average window size of the same connections is shown Fig. El For both performance 
indices, the accuracy of analytical performance predictions is excellent. 

One of the most attractive features of a useful analytical model is its ability to 
provide results for scenarios where simulation fails, due to excessive CPU or memory 
requirements. We found that running simulations in ns-2 with more than about 1,000 
TCP connections becomes very expensive with standard computers; this is a significant 
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drawback, because channel speeds in IP networks are growing fast, and the number of 
concurrent TCP connections follows closely. We explored with the queuing network 
model a not-too-futuristic scenario where the poli-to link capacity is 100 Mb/s and 
the Mi-NY link capacity is 1 Gb/s (other links are scaled accordingly); simulation results 
are obviously not available for this case. Fig.|5| reports the 3D plot of the packet loss 
probability in this scenario. The number of concurrent connections is up to 1,000 on 
the POLI-TO link, and up to 12,000 on the Mi-NY link, that collects most of the traffic 
between the U.S. and Northern Italy. We can observe that the qualitative performance 
of the network does not change significantly by increasing the capacity and the number 
of connections. In particular, we found that for a given value of bandwidth per TCP 
connection, the loss ratio and the average TCP window size are almost constant. This 
observation gives strength to empirical models, such as those in H, that derive the 
TCP throughput as a function of the loss probability. In addition, it partially relieves 
the worries concerning the possibility of exploiting TCP over very fast WANs, often 
expressed by researchers, but never proved due to the impossibility of simulating such 
environments or to load test beds with present day applications. 

5.3 Results for Short-Lived TCP Connections 

We now discuss results obtained with the open multiclass queuing network model for 
short-lived NewReno connections in the case of single bottleneck topologies. We con- 
sider the cases of buffer sizes equal to either 128 or 64 packets, and TCP tic equal to 
500 ms. The amount of data to be transferred by each connection (i.e., the file size) is 
expressed in number of segmenfs. We consider a fraffic scenario where fhe file size has 
the following distribution: 50% of all TCP connections comprise 10 segments, 40% 
comprise 20 segments and 10% comprise 100 segments. Of course, the chosen file size 
disfribution is just an example, and is way too simple to acceptably describe a realistic 
situation. Note however that, given any file size distribution, the queuing network model 
can be solved with limited complexity. 

Fig.Qreports analytical and simulation results for the average packet loss probability 
in both the USA Browsing and Local Access scenarios. Results are plotted versus 
the normalized external load, which is the load that would results in the case of no packet 
loss, so that retransmissions are not necessary. The upper curve refers to the case of 64 
packet buffers, while the lower one refers to the case of 128 packet buffers. Since with our 
modeling approach the packets loss probability depends on neither the TCP connections 
round trip time, nor the link capacity (note that the external load is normalized), the 
analytical loss probability estimates are the same for both channels, and simulation 
results indeed confirm fhis prediction. 

When the file transfer latency is considered, results are obviously no more indepen- 
dent from link capacities and round trip times. Fig.Elreports results for buffer 128 only; 
the case with buffer 64 yields similar results, not reported for lack of space. The left 
plot refers to the USA Browsing scenario, while the right one refers to the Local 
Access scenario. Both plots report three curves, one for each file size. For light and 
medium loads, the TCP session latency is dominated by the transmission delay, since 
loss events are rare; hence longer sessions have longer transfer times and the latency 
increase is roughly linear. 
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Fig. 7. Packet loss probability as a function of the external load for NewReno connections 
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Fig. 8. Average session completion time as a function of the external load for the US A Browsing 
scenario (left plot) and for the Local Access scenario in the case of 128 packets buffer 



5.4 Comparison of the Results Obtained for Greedy 
and Short-Lived TCP Connections 

So far, we have separately discussed the performance of greedy and short-lived TCP 
connections, but we have not compared the two. Traditionally, in the development of 
TCP models, short-lived connections were seen as “portions” of a greedy connection, 
disregarding the importance of transients. We discuss here the different behavior of short- 
lived and greedy TCP connections. We consider a single bottleneck network topology, 
that can be either the Local Access or the USA Browsing scenario, and the 
TCP-Tahoe version. In order to compare greedy and short-lived TCP connections, it is 
important to identify a “common operating condition” (COC for short), which cannot 
be the link utilization, because greedy connections force it to be close to one, provided 
that TCP connections are not window-limited. The two COC we chose to consider are 
either the packet loss probability or the average number of active connections (note that 
fixing one of those does not imply that other system parameters, e.g. the link loads, are 
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Fig. 9. Packet loss probability as a function of the average number of homogeneous connections 
in the in the USA Browsing scenario (top plot) an in the Local Access scenario (bottom 
plot) with 128 packet buffer 



equal). Observe that the COC is in general the result of the interaction between the TCP 
sources and the network, not a fixed parameter given to the model or to the simulator. 
The comparison is very interesting from the point of view of IP network dimensioning 
and planning, since results show that the performance figures obtained assuming the 
presence of greedy or finite connections can be quite different, thus possibly leading to 
inaccurate planning results if file sizes are not considered. 

We discuss two types of comparisons: the packet loss probability curves as a function 
of the average number of connections, and the average throughput curves as a function of 
the packet loss probability. Fig.Elshows the packet loss probability curves for different 
values of the file size: 10, 20, 40, 100, 200 packets, or infinite (greedy connections). 
While the results for finite connections indicate that the loss probability increases with 
the file size (this is due to an increase in the network load and in the traffic burstiness), 
the position of the curve for greedy TCP connections is far from intuitive, crossing all 
other curves as the number of connections increases. The differences in performance are 
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Fig. 10. Packet loss probability as a function of the average number of Tahoe connections (mixing 
different sizes) in both scenarios in the case of 128 packets buffer 



remarkable, but most important is the observation that greedy connections do not allow 
a “worst case analysis.” Transfers of a few hundred kbytes of data (100-200 packets) 
yield much worse performance than greedy connections, due to a higher burstiness in 
the traffic generated by TCP during its initial transient that typically lasts several tens 
of packets. Fig. Da shows similar curves in the Local Access scenario with mixed 
length short lived connections, using the file size distribution that was introduced before 
(50% with length 10, 40% with length 20, and 10% with length 100). Given this file 
size distribution, the performance results are driven by very short connections, and thus 
greedy connections do indeed allow a worst case analysis. 

Even more interesting is the comparison of the throughput achieved by TCP con- 
nections for a given average packet loss probability, as shown in Fig. HD The difference 
between greedy and finite connections is impressive, since the convexity of the curves 
is different in the two cases, suggesting significant differences in the dynamic behavior. 
For low packet loss probabilities, the throughput of finite connections is limited by the 
window size, while greedy connections can grow their window large enough to achieve 
the maximum possible throughput. Less intuitive is the behavior for medium packet 
loss probabilities, where connections comprising 100 or 200 packets obtain a through- 
put higher than that of greedy connections. This behavior is rooted in the initial TCP 
connection transient: during the first slow start phase, the window of finite connections 
increases geometrically to large sizes, much bigger than those achieved by greedy con- 
nections. Larger windows, coupled with the transient dynamics, means burstier traffic 
and bursty, correlated losses. This results in many connections terminating without any 
loss, while the others experience only one or two loss events, maybe with several pack- 
ets lost, but typically triggering a single timeout. For high packet loss probabilities, the 
throughput becomes similar for all kinds of connections, since the window size always 
is quite small. 

The consequences of these behaviors are that result for greedy TCP connections 
should be used with great care: if the packet loss probability and the round trip time 
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Fig. 11. Average throughput as a function of packet loss probability of homogeneous connections 
in the USA Browsing scenario (top plot) and in the Local Access scenario 



of a TCP connection are known, and the existing closed-form formulas (like the square 
root formula or its refinements) are used to estimate the connection throughput, the 
result may be rather different from the throughput of a short-lived TCP connection, 
and consequently the computed file transfer time may be inaccurate. Considering the 
fact that real TCP connections are necessarily finite, and often very short, the greedy 
source approximation should not be used, for instance, to predict the performance of 
web browsing TCP connections. 



6 Conclusions 

In this paper we described an analytical approach based on queuing networks to estimate 
the performance of greedy and short-lived TCP connections in terms of round trip time, 
loss probability and throughput of greedy and short-lived TCP connections, as well as 
average completion times of short-lived TCP flows. Models are built in a modular fashion. 
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developing one or more ‘TCP sub-models’ and a ‘network sub-model,’ that are iteratively 
solved. TCP sub-models are closed single-class queuing networks in the case of greedy 
TCP connection and open multi-class queuing networks in the case of short-lived TCP 
flows. The model solution requires as inputs only the primitive network parameters, and 
allows taking into consideration different TCP versions and multi-hottleneck networks, 
obtaining solutions at very little cost. 

We presented numerical results for some simple single and multi-hottleneck network 
topologies, proving the accuracy of the analytical performance predictions, as compared 
to simulation point estimates and confidence intervals obtained with ns-2. In addition, we 
also compared numerical results for greedy and short-lived TCP connections, discussing 
the common practice of applying to short-lived TCP flows the performance predictions 
computed in the case of greedy TCP connections. The conclusion of this comparison is 
that greedy TCP connections can actually provide a worst case indication with respect to 
short-lived TCP flows, if the latter are rather short, but not in general. Indeed, when short- 
lived TCP flows comprise some hundreds packets, they can suffer worse performance 
than greedy TCP connections. 

Further work in this area is still in progress, extending the modeling technique to 
other TCP versions, and trying to approximately account for losses of ACKs on the 
return path. 
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Abstract. This paper presents the architecture of a passive monitoring 
system installed within the Sprint IP backbone network. This system dif- 
fers from other packet monitoring systems in that it collects packet-level 
traces from multiple links within the network and provides the capabil- 
ity to correlate the data using highly accurate GPS timestamps. After 
a thorough description of the monitoring systems, we demonstrate the 
system’s capabilities and the diversity of the results that can be obtained 
from the collected data. These results include workload characterization, 
packet size analysis, and packet delay incurred through a single backbone 
router. We conclude with lessons learned from the development of the 
monitoring infrastructure and present future research goals. 



1 Introduction 

Network traffic measurements provide essential data for networking research 
and operation. Collecting and analyzing such data from a tier 1 ISP backbone, 
however, is a challenging task. The traffic volume ranges from tens of Mb/sec 
on OC-3 access links to 10 Gb/sec on OC-192 backbone links. The measurement 
equipment must be installed in commercial network facilities where physical 
space and power are constrained, and which are, in some cases, not staffed by any 
human operators. Data analysis involves processing terabytes of data, and must 
account for unusual phenomena such as routing loops and malicious network 
users. 

This paper presents our experiences developing the Sprint IP Monitoring 
System, a traffic measurement system for the Sprint Internet IP network. The 
Sprint Internet network is a tier 1 IP backbone connecting 20 Points of Presence 
(POPs) in the United States. The monitoring system is designed to collect syn- 
chronized traces of traffic data from multiple links for use in in-depth research 
projects where aggregate statistics are insufficient. 

The Sprint IP Monitoring System consists of three basic components: 
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— A set of data collection systems (IPMON systems) which collect TCP/IP 
headers of every packet transmitted on various links in the Sprint network, 
along with a measurement system that collects BGP and IS-IS routing in- 
formation. Currently 11 IPMON systems are deployed at one POP in the 
Sprint network. Two additional POPs, each with 10 systems, are scheduled 
to be installed. 

— A data repository which archives the traces collected by the IPMON systems. 

— A 16 node computing cluster used for data analysis. 

The design of the system focuses on four major issues: collecting packet traces 
from high speed links, synchronizing the traces, manipulating the large data 
sets, and administering the system. Collecting traces from high speed OC-3 (155 
Mb/sec) and OC-12 (622 Mb/sec) links has been addressed by several prior 
measurement systems PP) Our system follows the same general design, and we 
discuss some of the challenges when extending it to OC-48 (2.48 Gb/sec) speeds. 
Synchronizing the clocks that generate the packet timestamps is accomplished 
using a stratum-1 GPS reference clock distributed to the IPMON systems. The 
data set for a single 24 hour trace collected on all of the IPMON system is 
1.1 TB. This data is compressed and transmitted from the IPMON systems to 
the data repository over a dedicated OC-3 link. The data repository and data 
analysis systems are interconnected using a gigabit Ethernet network. All system 
administration functions for the IPMON systems may be performed from the 
lab. In cases of extreme failures, the entire operating system may be reinstalled 
over the network. 

The remainder of the paper describes the details of how we address these 
design issues in the IP Monitoring System and presents some sample results 
that demonstrate the system’s capabilities. Section El discusses other work on IP 
monitoring systems and compares them with our monitoring infrastructure. Sec- 
tion Eldescribes the architecture of the network in which our monitoring systems 
are installed. Section El presents the design requirements and details of the mon- 
itoring system. Section El presents traffic measurements which demonstrate the 
capabilities of our system and which evaluate the system performance. Section 
El concludes and discusses areas of future research. 

2 Related Work 

There has been much work on active and passive network measurement systems. 
A measurement system is called active if it injects measurement traffic, such as 
probe packets, in the network. Passive measurement systems, on the other hand, 
do not inject any measurement traffic but rather observe the actual traffic ffowing 
in the network. 

Active measurement systems include NIMI, MINC, Surveyor, AMP, and 
lEPM. The NIMI (National Internet Measurement Infrastructure) project de- 
veloped an architecture for deploying and managing scalable active measure- 
ment systems Ej. The NIMI system uses tools such as ping, traceroute, mtrace, 
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and treno to perform the actual measurements w MINC (Multicast-based 
Inference of Network-internal Characteristics) measurement systems transmit 
multicast probe messages to many destinations, and infer link loss rates, delay 
distributions, and topology based on the observed correlations of the received 
packets Surveyor uses a set of approximately 50 GPS synchronized hosts 
to measure one-way and round-trip delay over various Internet paths jjj- The 
AMP (NLANR Active Measurement Project) system consists of a set of moni- 
toring stations which measure the performance of the vBNS backbone |^. lEPM 
(Internet End-to-end Performance Monitoring) monitors network performance 
between high energy nuclear and particle physics research institutions |^. In 
addition, companies such as Keynote and Matrix conduct commercial network 
performance measurements mm 

Passive measurement systems include Simple Network Management Proto- 
col (SNMP)-based network traffic measurement tools, tcpdump, NetFlow, and 
CoralReef. SNMP is the most widely used network management protocol in 
today’s Internet m- Agents and remote monitors update a management infor- 
mation base (MIB) within network routers, and management stations retrieve 
MIB information from the routers using UDP. Most routers support SNMP and 
implement public MIBs as well as vendor-specific private MIBs. Using SNMP, 
for example, network operators can keep track of the number of packets and 
bytes that have arrived on an interface, the number of packets and bytes that 
have been dropped on an interface, and the number of transmission errors that 
have occurred on a link. Another common network monitoring tool is tcpdump. It 
collects packets transmitted and received by systems running the Unix operating 
system m NetFlow is a monitoring system available on Cisco routers which 
collects flow statistics observed by an interface m NetFlow provides more de- 
tailed information than is available through SNMP, but it requires an external 
system to record the NetFlow data. The CoralReef suite, developed by CAIDA 
and originally based on the OC3MON developed at MCI, collects timestamped 
packet traces from various ATM and SONET links USD This system is very 
similar to our monitoring system, but it does not have GPS synchronization. 

Other efforts have been made in routing and standardization of metrics. The 
Internet Performance Measurement and Analysis (IPMA) project investigates 
routing behavior and network failures fE]- The IETF IP Performance Metrics 
(IPPM) working group standardizes metrics for evaluating network performance 
based on observations realized within the projects described above To our 
knowledge, the only project to address network-wide traffic analysis in a com- 
prehensive manner has been developed at AT&T j0[^. This project relies on 
packet-level information collected by packet sniffers called PacketScopes, flow 
statistics collected using Cisco’s NetFlow tools, and routing information. It also 
includes active components which collect loss, delay, and connectivity statistics. 
Results from active and passive components are combined to be used in network 
monitoring and management of the AT&T backbone. 

Our project is unique in that it allows trace collection at different points in a 
commercial backbone network, and it provides the capability to correlate these 
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traces through highly accurate timestamps. Many other traffic measurement sys- 
tems are installed in either research or access networks HSIIHIEII]. Some results 
from commercial IP backbones are available in I3IQ, but they do not have the 
GPS synchronization capabilities of our system. 



3 Backbone Network Description 




Fig. 1. Tier 1 IP backbone network 



A backbone IP network provides connectivity over a geographically wide area. 
The network topology consists of a set of nodes known as Points-of-Presence 
(POPs) connected by high bandwidth backbone links. These links are typically 
2.5 Gb/sec OG-48 links or 10 Gb/sec OG-192 links. Each POP also contains 
links to customers (e.g. large corporate networks, regional ISPs providing dial- 
up access, large web servers), ranging from 1.5 Mb/sec T1 links to 622 Mb/sec 
OG-12 links. Figure Q shows the architecture of a backbone network. 

A backbone network connects to other backbone networks at private peer- 
ing points or public network access points (NAPs). Two networks can peer at 
multiple points to accommodate the traffic volume between them as shown in 
Figure ^ The peering points are intended to carry traffic that originates from 
a customer connected to one backbone ISP and is destined to a customer of 
another backbone ISP. Most peering agreements prohibit transit traffic, or traf- 
fic whose source and destination are not customers of the backbone ISP. For 
example. Backbone ISP #2 would not accept traffic from Backbone ISP #1 if 
the destination was not one of Backbone ISP #2’s customers. 

In case of the Sprint Internet backbone, there are 20 POPs located in the 
continental United States. The POPs in the Sprint Internet backbone have a 
two-level hierarchical structure as shown in Figure 0 At the lower level, cus- 
tomer links are connected to access aggregation routers. The access routers are 
in turn connected to the higher level backbone routers. The backbone routers 
provide connectivity to other POPs, and they also connect to public and private 
peering points. The backbone links connecting the POPs are optical fibers with 
a bandwidth of 2.5 Gb/sec (OG-48) or 10 Gb/sec (OG-192). All of the links 
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WAN circuits 




Fig. 2. POP architecture 



carry IP traffic using a proprietary version of the Packet-over-SONET (POS) 
protocol similar to that proposed in 

4 System Description 

The goal of the Sprint IP Monitoring System is to collect data from the Sprint 
Internet backbone that is needed to support a variety of research projects. The 
particular research projects include studying the behavior of TCP, evaluating 
network delay performance, investigating the nature of denial of service attacks, 
and developing network engineering tools. While each project could develop a 
customized measurement system, many of the projects require similar types of 
data and installing monitoring equipment in a commercial network is a complex 
task making a general purpose measurement system more preferable. 

To meet this goal, the IP Monitoring system is designed to collect and an- 
alyze synchronized packet level traces from selected links in the Sprint Internet 
backbone. These packet traces consist of the first 44 bytes of every packet car- 
ried on the links along with a 64 bit timestamp. The clocks which generate the 
timestamps are synchronized to within 5 /iS using a GPS reference clock. This 
provides the capability to measure one-way network delays and study correla- 
tions in traffic patterns. 

Trace collection and analysis is accomplished using three separate systems. 
A set of data collection components (IPMON systems) collect the packet traces. 
The traces are transferred to a data repository which stores the traces until they 
are needed for analysis. Analysis is performed on a cluster of 16 Linux servers. 
The remainder of this section describes the requirements and architecture of 
each of these components. 

4.1 IPMON Systems 

The monitoring systems, called IPMON systems, are responsible for collecting 
packet traces from the network. These system consists of a Linux PC with a large 
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disk array and a SONET network interface, known as the DAG card To 

collect the traces, an optical splitter is installed on selected OC-3, OC-12, and 
OC-48 links in the network, and one output of the splitter is connected to the 
DAG card in the IPMON system. 



Table 1. Packet Record Format 



bytes 

0 

4 

8 

12 

16 

20 



64 



field description 
8 byte 
timestamp 

record size (fixed to 64 bytes) 
POS frame length 
HDLC header 
First 44 bytes of IP packet 



The DAG card decodes the SONET payloads and extracts the IP packets. 
When the beginning of a packet is identified, the DAG card generates a times- 
tamp for the packet, extracts the first 48 bytes of the POS frame which contains 
4 bytes of POS header and 44 bytes of IP data, and transfers the packet record 
to main memory in the PG using DMA. The format of the packet record is 
shown in Table G1 If the packet contains fewer than 44 bytes, the data is padded 
with all O’s. Once 1 MB of data has been copied to main memory, the DAG 
card generates an interrupt which triggers an application to copy the data from 
main memory to the hard disk. It would be possible to transfer the data from 
the DAG card directly to the hard disk, and bypass the main memory. Main 
memory, however, is necessary to buffer bursts of traffic as described later in the 
section. 

The IPMON system has 5 basic design requirements: 

— Support data rates ranging from 155 Mb/sec (OG-3) to 2.5 Gb/sec (OG-48) 

— Provide synchronized timestamps 

— Occupy a minimal amount of physical space 

— Prevent unauthorized access to trace data 

— Be capable of remote administration 

Next we describe how each of these requirements are met in the system design. 



Data Rate Requirements. The data rate requirements for OG-3, OG-12, and 
OG-48 links are summarized in Table o The first line of the table shows the 
data rate at which the DAG card must be able to process incoming packets. 
After the DAG card has received a packet and extracted the first 44 bytes, the 
timestamp and additional header information is added to the packet record and 
copied to main memory. If there is a sequence of consecutive packets whose size 
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is less than 64 bytes, then the amount of data that is stored to main memory is 
actually greater than the line rate of the monitored link. The amount of internal 
bandwidth required to copy a sequence of records corresponding to minimum size 
TCP packets (40 byte packets) from the DAG card to main memory is shown 
on the second line of Table 0 To support this data rate, the OC-3 and OC-12 
IPMON systems use a standard 32 bit, 33 MHz PCI bus which has a capacity 
of 1056 Mb/sec (132 MB/sec). The OC-48 system, however, requires a 64 bit, 
66 MHz PCI bus with a capacity of 4224 Mb/sec (528 MB/sec). It is possible to 
have non-TCP packets which are smaller than 40 bytes resulting in even higher 
bandwidth requirements, but the system is not designed to handle extended 
bursts of these packets as they do not occur very frequently. It is assumed that 
the small buffers located on the DAG card can handle short bursts of packets 
less than 40 bytes in size. The impact of this design decision is evaluated in 
Section 0 



Table 2. Data rate requirements 





OC-3 


OC-12 


OC-48 


link rate (Mb/sec) 


155 


622 


2480 


peak capture rate (Mb/sec) 


248 


992 


3968 


1 hour trace size (GB) 


11 


42 


176 



Once the data has been stored in main memory, the system must be able to 
copy the data from memory to disk. The bandwidth required for this operation, 
however, is significantly lower than the amount of bandwidth needed to copy the 
data from the DAG card to main memory as the main memory buffers bursts 
of small packets before storing them to disk. Only 64 bytes of information are 
recorded for each packet that is observed on the link. As reported in prior studies, 
the average packet size observed on backbone links ranges from about 300-400 
bytes during the busy periods of the day m- For our design, we assume an 
average packet size of 400 bytes. The disk I/O bandwidth requirements are 
therefore only 16% of the actual link rate. For OC-3 this is 24.8 Mb/sec; for 
OC-12, 99.5 Mb/sec; and for OC-48, 396.8 Mb/sec. To support these data rates, 
we use a three-disk RAID array for the OC-3 and OC-12 systems which has 
an I/O capacity of 240 Mb/sec (30 MB/sec). The RAID array uses a software 
RAID controller available with Linux. To support OC-48 we use a five-disk RAID 
array with higher performance disks that can support 400 Mb/sec (50 MB/sec) 
transfers. To minimize interference with the data being transferred from the 
DAG card to memory, the disk controllers use a separate 32 bit 33 MHz PCI 
bus. 



Timestamp Requirements. In order to correlate the traces, the packet times- 
tamps generated by each IPMON system need to be synchronized to a global 
clock signal. This is accomplished using a dedicated clock on board the DAG 
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card. The clock runs at a rate of 16MHz which provides a granularity of 59.6 ns 
between clock ticks. 

Unfortunately, the oscillators used to drive the clock run just a little bit faster 
or a little bit slower than 16 MHz based on system temperature and the quality 
of the oscillator. Therefore, it is necessary to fine tune, or discipline, the clocks 
using an external stratum 1 GPS receiver located at the POP sites. The GPS 
receiver outputs a 1 pulse-per-second (PPS) signal which is distributed to all of 
the DAG cards located at the POP. 

The clocks synchronization on board the DAG card operate in the following 
manner At the beginning of trace collection the clock is loaded with the 
absolute time from the PG’s system clock (e.g. 7:00 am Aug 9, 2000 PST). The 
clock then begins to increment at a rate of 16 MHz. When the DAG card receives 
the first 1 PPS signal after initialization, it resets the lower 24 bits of the clock 
counter (note: 24 bits will count from 0 to 16 million. If the lower 24 bits of the 
clock are all 0 it represents the beginning of a second). Thereafter, each time the 
DAG card receives the 1 PPS signal, it compares the lower 24 bits of the clock 
to 0. If the value is greater than 0, the oscillator is running a little bit fast and 
the DAG card decreases the frequency slightly. If the value is less than 0, the 
oscillator is running a little bit slow and the DAG card increases the frequency 
slightly. 

In addition to synchronizing the DAG clocks, the IPMON systems must 
also synchronize their own internal clocks so that the DAG clock is correctly 
initialized. This is accomplished using NTP. A broadcast NTP server is installed 
on the LAN which is connected to the monitoring systems and is capable of 
synchronizing the system clocks in the PG to within 200 ms. This is sufficient 
to synchronize the beginning of the traces, and the 1 PPS signal is used to 
further synchronize the DAG clock. There is an initial period where the 1 PPS 
is attempting to correct the initial clock skew, so we ignore the first 30 seconds 
of each trace to account for this. 

There are several sources of error that may occur in the synchronization 
system. The first is clock skew between the 1 PPS signals generated by different 
GPS receivers located at different POPs. This error is minimal as we are using 
stratum 1 GPS receivers which are guaranteed to have a maximum clock skew 
of 500 ns. Another source of error is the difference in propagation time for the 
1 PPS signal. The 1 PPS signal is distributed to the DAG cards using a daisy 
chain topology. The difference in cable length between the first and last systems 
is 8 meters, which corresponds to a propagation delay of 28 ns. Finally, the 
clock synchronization mechanism cannot immediately adjust to changes in the 
oscillator frequency, it needs to wait for the next 1 PPS signal. To test this 
aspect of performance, we measure the maximum clock error that is observed 
when the DAG card receives a 1 PPS interrupt. The maximum value we have 
seen in lab tests is 30 clock ticks which represents an error of 1.79 ^s. The median 
error observed during these tests was 1 clock tick, or 59.6 ns. Adding all of these 
factors, the worst case skew between any two DAG clocks is less than 2 /xs. 
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Another source of error is that packets are not timestamped immediately 
when they arrive at the DAG card. They must first pass through a chip which 
implements the SONET framing. As this chip was initially designed to operate 
on 53 byte ATM cells, it is possible for two packets (one 40 byte packet and 
the first 13 bytes of the next packet) to be placed in a cell buffer at the same 
time. Since this buffer is read as an entire unit, both packets will have identical 
timestamps. However, their actual inter-arrival time is 2 /is if we are measuring 
an OC-3 link. This results in an additional 2 fis of timestamp error. This is only 
an issue for the OC-3 and OC-12 DAG cards. The OC-48 systems use a newer 
SONET framing chip which was designed to support POS directly and does not 
use 53 byte buffers. 

The total effect of these errors is a maximum of 5 /its of clock skew between 
DAG cards. However, we are interested in measuring the delay experienced by 
packets as they traverse the network. This delay is typically measured on the 
order of milliseconds, so the 5 /iS skew is acceptable. The only case where the 
clock skew affects the measurements are when we are interested in measuring 
the delay through a single router in the network. The minimum delay we have 
observed is 30 /xs, so a 5 /iS skew represents a 16% error in the measurement. 



Physical Requirements. In addition to supporting the bandwidth require- 
ments of OC-3, OC-12, and OC-48 links, the IPMON systems must also have a 
large amount of hard disk storage capacity to record the traces. As the systems 
are installed in a commercial network facility where physical space is a scarce 
resource, this disk space must be contained in small form factor. Using a rack- 
optimized system, the OC-3 and OC-12 IPMONs are able to handle 108 GB of 
storage in only 4U of rack spac^H. This allows the system to record data for 9.8 
hours on a fully utilized OC-3 link or 2.6 hours on a fully utilized OC-12 link. 
The OC-48 systems have a storage capacity of 360 GB, but in a slightly larger 
7U form factor. This is sufficient to collect a 2 hour trace on a fully utilized link. 
Fortunately, the average link utilization on most links is less than 50%, allowing 
for longer trace collection. 

The physical size constraint is one of the major limitations of the IPMON 
system. Collecting packet level traces requires significant amounts of hardware. 
These traces are critical for conducting research activities, but trace collection 
is not a scalable solution for operational monitoring of an entire network. The 
ideal solution is to use traces collected by the IPMON system to study the traf- 
fic and develop more efficient monitoring systems targeted towards exhaustive 
monitoring of all links in the network for operational purposes. 



Security Requirements. The IPMON systems collect proprietary data about 
the traffic on the Sprint Internet backbone. Preventing unauthorized access to 
this trace data is an important design requirement. This includes preventing 
access to trace data stored on the systems as well as preventing access to the 

lU is a standard measure of rack space and is equal to 1.75 inches or 4.45 cm. 
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systems in order to collect new data. To accomplish this, the systems are config- 
ured to accept network traffic only from two applications: ssh and NTP. ssh is 
an authenticated and encrypted communication program similar to telnet that 
provides to access a command line interface to the system. This command line 
interface is the only way to access trace data that has been collected by the 
system and to schedule new trace collections, ssh only accepts connections from 
a server in our lab and it uses an RSA key based system to authenticate users. 
All data that is transmitted over the ssh connection is encrypted. 

The second type of network traffic accepted by the IPMON systems is NTP 
traffic. The systems only accept NTP messages which are transmitted as broad- 
cast messages on a local network used exclusively by the IPMON systems. All 
broadcast messages which do not originate on this network are filtered. 



Remote Administration Requirements. In addition to being secure, the 
IPMON systems must also be robust against failures since they are installed, 
in some cases, where there is no human presence. To detect failures, a server 
in the lab periodically sends query messages to the IPMON systems. The re- 
sponse indicates the status of the DAG cards and of the NTP synchronization. 
If the response indicates either of these components fails, the server attempts to 
restart the component through an ssh tunnel. If the server is not able to cor- 
rect the problem it notifies the system administrator that manual intervention 
is required. In some cases, even the ssh connection will fail, and the systems 
cannot be accessed over the network. To handle this type of failure, the systems 
are configured with a remote administration card that provides the capability 
to reboot the machine remotely. The remote administration card also provides 
remote access to the system console during boot time. In cases of extreme fail- 
ure, the system administrator can boot from a write-protected floppy installed 
in the systems and completely reinstall the operating system remotely. 

The one event that cannot be handled remotely is hardware failure. The 
monitoring systems play no role in the operation of the backbone, and thus we 
decided not to provide hardware redundancy to handle failures. 



4.2 Data Repository 

The data repository is a large tape library responsible for archiving the trace 
data. Once a set of traces has been collected on the IPMON systems, the trace 
data is transferred over a dedicated OC-3 connection from the IPMONs to the 
data repository. 

A single 24-hour-long trace from all of the monitoring systems currently 
installed consumes approximately 1.2 TB of disk space (this will increase to 

3.3 TB when the additional 20 systems are installed). The tape library has 10 
individual tape drives which are able to write data at an aggregate rate of over 
100 MB/sec. The rate at which the data can be transferred from the remote 
systems, however, is limited to 100 Mb/sec which is the capacity of the network 
interface cards on the IPMON systems. At this rate the raw data would take 
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26.4 hours to transfer from the IPMON systems to the tape library. To improve 
transfer time and decrease the storage capacity requirements, the trace data is 
compressed before being transferred back to the lab. Using standard compression 
tools such as gzip, we are able to achieve compression ratios ranging from 2:1 to 
3:1 depending on the particular trace characteristics. This reduces the transfer 
time to about 12 hours. 

This transfer time presents another difficulty when exhaustively monitoring a 
network for operational purposes. An alternative solution would be to avoid the 
data repository and perform the analysis on the monitoring systems themselves. 
This is a good solution if there is a single type of analysis that is being performed 
on the traces. However, the data is used for many research projects, and some of 
the analysis performed in several projects requires multiple iterations through 
the trace. In addition, we would like to keep an archive of the collected data so 
that it may be used for future projects. 

4.3 Analysis Platform 

All data analysis is performed off-line by a cluster of 16 Linux PCs. Some of the 
data analysis, such as measuring packet size distributions, could be performed 
on-line but others, such as measuring network delay, cannot. Measuring network 
delay requires identifying a packet on multiple traces. This involves exchanging 
significant amounts of data between two (or more) IPMON systems for every 
packet whose delay is measured. This could be accomplished on a local network 
if both systems are located in the same POP, but it would significantly increase 
the network load when processing data collected at multiple POPs. Since we 
do not want to perturb the network during trace collection, it is necessary to 
perform the analysis off-line. 

There are two categories of analysis that are performed by the analysis plat- 
form: 

— Single trace analysis involves processing data from a single link to measure 
traffic characteristics. This type of analysis includes, for example, determin- 
ing packet size distributions, flow size distributions, and determining what 
types of applications are using different links. To efficiently perform this type 
of analysis, traces from different links are loaded onto separate PCs in the 
cluster and processed in parallel. 

— Multi-trace analysis involves correlating traffic measurements among differ- 
ent links. This includes performing delay measurements and looking at round 
trip TCP behavior. This type of analysis is performed by dividing each trace 
into several time segments and loading the different time segments onto dif- 
ferent machines. For example, PC ^1 might contain the first 30 minutes 
from a set of 5 traces, and PC #2 might contain the next 30 minutes of 
those same 5 traces. 

One key requirement when performing multi-trace analysis is to be able to iden- 
tify an individual packet as it travels across multiple links in the network. The 
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only two pieces of information that should change as a packet travels through the 
network are the TTL and checksum fields in the IP header. By comparing the 
remaining 41 bytes of data we collect for each packet, we can identify a packet 
at multiple locations in the network. However, it is possible for two different 
packets to have the same 41 bytes. In theory this should happen infrequently 
since the ID field for each packet generated by a particular source should be 
unique. However, in the traces we collect we do observe duplicate packets due 
to systems generating incorrect IP id fields or due to link layer retransmissions, 
but these packets only represent .001% to 0.1% of the total traffic volume. In 
these cases, we typically ignore all packets which have duplicate values. 



5 Measurement Results 

In this section we present a sample of measurement results to demonstrate the 
types of data that can be collected using the IPMON measurement facilities. 
The presented data are not intended to make any generalizable statement on 
the nature of the traffic on an IP backbone. The intent is to validate our moni- 
toring infrastructure and to demonstrate its capabilities using a few simple trace 
analyses. 
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Fig. 3. Measurement Configuration 



For brevity, we present data from only two of the nine bi-directional moni- 
tored links. Both links are connected to the same core router in one POP within 
the Sprint IP backbone network. The trace web-out was collected on a link from 
the backbone router to an access router connected to a web hosting company, 
and web-in is the link from the access router to the backbone router. The peer- 
out trace was collected on a link from the backbone router to a peering point, 
and peer-in is the link from the peering point to the backbone router. Both the 
peering link and the web hosting link are OC-3 links. Figure 0 shows a diagram 
of the monitored links. It is important to note that there are other, unmonitored, 
links which are also connected to the backbone router. We do not exhaustively 
monitor all links in the POP. Table 0 provides the trace start times, trace end 
times, and trace sizes. 
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Table 3. Trace Statistics 



Link 


Start Time (PST) 


End Time (PST) 


Number of Packets (millions) 


web-out 


9:56, Wed, 8/9/2000 


19:57, Wed, 8/9/2000 


568 


web-in 


9:56, Wed, 8/9/2000 


23:18, Wed, 8/9/2000 


852 


peer- out 


9:56, Wed, 8/9/2000 


9:55, Thurs, 8/10/2000 


816 


peer-in 


9:56, Wed, 8/9/2000 


9:55, Thurs, 8/10/2000 


794 



5.1 Workload Characterization 

First we present the general characteristics of the traces. Figures 0] and 0 plot 
link utilization averaged over one minute periods. Figure IEI shows the application 
mix on the web-out trace. From these figures we make several observations: 



Link Utilization 



Link Utilization 



peer-out 



B Si 



8 ^ 






,Vt** 



10:00 14:00 18:00 22:00 2:00 6:00 10:00 

Aug 9 2000 Aug 10 2000 



8 

§ 




10:00 14:00 18:00 22:00 2:00 6:00 10:00 

Aug 9 2000 Aug 10 2000 



Fig. 4. Peering link utilization in Mb/sec Fig. 5. Web link utilization in Mb/sec 



— As reported in many other studies, the dominant source of traffic is http. We 
only present the results from the web-out link, but the results are similar on 
the other links. 

— The link utilization on the peering link changes dramatically (from nearly 
100 Mb/sec to under lOMb/sec) over a 24 hour period. 

— Link utilization is not symmetric for either the peering link or the web host- 
ing link. While this is expected for the web hosting traffic, as the web servers 
should generate much more data than they receive, the traffic was expected 
to be more symmetric on the peering point. All of the links we monitor ex- 
hibit such asymmetric characteristics. It may be possible to take advantage 
of this fact when allocating disk space for the traces if a single system is 
used to monitor both directions of a link. 

— Link utilization is typically under 50% with peaks reaching just over 60%. 
The data we present represents two of the most heavily utilized monitored 
links. The drop in link utilization around midnight corresponds to a main- 
tenance period. 
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Application Mix 




Fig. 6. web-out traffic breakdown by application 



Another point to note is that the web link trace reflects the limited scalability 
of our monitoring system. The reason why the trace in Figure 0 is truncated 
around midnight is that there was insufflcient disk space. In an effort to increase 
the number of monitored links, we configured one IPMON system to monitor 
both directions of the web link, and tried to optimize the disk allocation to take 
advantage of the traffic asymmetries. However, this was not successful. The disk 
space on a single IPMON system is not sufficient to capture a full 24 hour trace 
for both of these links. Later traces collected on the web link used one IPMON 
system for each direction. 



Table 4. Trace statistics 



trace 


TCP 

pack- 

ets 


UDP 

pack- 

ets 


Other 

pack- 

ets 


min 

packet 

size 


average 

packet 

size 


max 

packet 

size 


IP 

frag- 

ments 


IP op- 
tions 


TCP 

op- 

tions 


web-out 


530 

million 


33 mil- 
lion 


4 mil- 
lion 


20 


339 


1500 


57,549 


2068 


67 mil- 
lion 


web-in 


806 

million 


34 mil- 
lion 


13 mil- 
lion 


20 


540 


1500 


420,269 


1548 


66 mil- 
lion 


peer- out 


740 

million 


106 

million 


7 mil- 
lion 


20 


590 


1500 


292,450 


828 


79 mil- 
lion 


peer-in 


635 

million 


143 

million 


14 mil- 
lion 


20 


315 


63,945 


164,924 


2915 


67 mil- 
lion 



Other traffic characteristics are summarized in Table 0 This table shows 
the number of TCP/UDP/other packets; the minimum, average, and maximum 
packet sizes; the number of packets which were IP fragments; and the number 
of packets with IP and TCP options. The number of packets with IP and TCP 
options affect the amount of information that is provided by the trace data. If 
a packet contains either type of option, then the size of the TCP/IP header can 
exceed the 44 bytes of data we collect. If this is the case, we may lose information 
about the TCP port numbers, sequence numbers, or flags. As can be seen in the 
table, the number of packets which contain IP options is less than .0004% of 
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the total traffic volume. The number of packets with TCP options, on the other 
hand, can be up to 12% of the total traffic. However, the only part of the header 
which we do not capture on these packets is the TCP options. We are able to 
record the source/destination port, the TCP sequence numbers, and flags. 

5.2 Packet Size Analysis 



Packet Size Distribution Packet Size Distribution 





Fig. 7. packet size distribution on peer- Fig. 8. packet size distribution on web 
ing link link 



The packet size characteristics of a link impacts two system design parame- 
ters: 

— the duration of the trace that may be collected 

— the rate at which the IPMON systems must process incoming packets 

The IPMON systems record 64 bytes for each packet. Therefore, the duration of 
the trace is limited to a particular number of packets. If two links are running at 
the same link utilization, the system monitoring the link with the higher average 
packet size will be able to record a longer trace. 

The packet size distribution we observe follows the same tri-modal distribu- 
tion as observed in other studies The cumulative distribution function 

of the packet size distribution is shown in Figures Hand El The packet size distri- 
bution has peaks at 40 bytes (minimum size TCP packets, 1500 bytes (maximum 
size Ethernet packets), and at 552 and 576 bytes (maximum size TCP packets 
from TCP implementations which do not perform MTU discovery). The mini- 
mum, average, and maximum packet sizes for each link are shown in tabled The 
packet size distribution on the web links explains the reason for the discrepancy 
in trace durations. Each trace was configured to use an amount of disk space 
proportional to the link utilization (i.e. the web-in link was configured with ap- 
proximately twice the capacity of the web-out link). The goal was to collect the 
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same duration of trace from each link. However, as can be seen from the packet 
size distributions, the web-out link (which is the output link from the network to 
the web server customer) carries a large number of small packets containing web 
requests and acknowledgements, while the web-in link carries a large number of 
maximum size packets containing web data. Therefore, when determining the 
amount of disk space required to collect the traces, the traffic volume in terms 
of packets per second is a better measure than the traffic volume in terms of bits 
per second. 
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Fig. 9. Peer link traffic volume in pack- Fig. 10. Web link traffic volume in 
ets/sec packets/sec 



The traffic volume in terms of packets/sec are shown in Figures 0and El 
These values are computed over one minute averages. The figures demonstrate 
that the data rate in packets per second on the web-in link is nearly the same as 
the data rate on the web-out link, rather than twice the data rate as predicted 
by the bits per second data rate. 

The figures also indicate how the packet size characteristics affect the data 
rate requirements of the IPMON systems. Figure 0 shows several peaks in the 
traffic volume in packets/sec on the peering point links. These peaks, however, 
do not correspond to equivalent peaks in overall traffic volume in bytes (i.e. there 
is no equivalent peak observed in Figure^. The peaks correspond to bursts of 
small packets that occur in the network. In this case, the peaks actually repre- 
sent a large number of SYN packets which are all transmitted to one particular 
destination, a common denial-of-service attack. 

Regardless of the source of the traffic, a sequential arrival of small packets 
imposes a performance burden on the IPMON system. To evaluate this burden 
we count the number of packets of similar sizes which arrive sequentially (e.g. the 
number of 40 byte packets that arrive back-to-back) and plot the distribution in 
Figure [O We categorize packets into three classes: small, medium, and large. 
Small packets are less than 500 bytes, medium packets are between 500 and 
1000 bytes, and large packets are longer than 1000 bytes. Sequential arrivals of 
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medium and large packets are similar on both peer-in and peer-out links. For 
small packets, the number of sequential arrivals can be quite large, and in the 
case of peer-out, even reach 599. 





Fig. 11. Sequential packet arrivals on Fig. 12. Peak rate for peering link 
peer link 



The number of sequential arrivals, however, does not tell how close in time the 
packets arrive. To capture the temporal aspects of packet bursts, we examine the 
peak arrival rate in packets per second on the monitored links shown in Figure 
[El This figure shows the peak traffic rate (in packets/sec) at time scales ranging 
from 10 ms to 1 sec. The data is generated by computing the average arrival 
rate over 10 ms intervals for the entire trace. Using this data we compute the 
peak arrival rate observed over any of the 10 ms intervals. We then evaluate the 
peak rate at a range of time intervals ranging from 10 ms to 1 sec. Even at the 
smallest time scale, 10 msec, the peak arrival rate is only 231,000 packets/sec, 
well within 1.94 million packets/sec supported by the IPMON systems. 

The peak rate determines the data rate required to copy traffic data from the 
DAG cards to main memory in the system. The packet sizes also affect the data 
rate required to store the data to disk. The systems were originally designed to 
support traffic with an average packet size of 400 bytes, which does correspond 
to the average packet size across all the links. However, from our measurements, 
the average packet size on a single link can be closer to 300 bytes. The disks on 
the OC-3 and OC-12 monitors have enough bandwidth to support the smaller 
average packet size, but the OC-48 monitors can only support up to 1.9 Gb/sec 
of traffic if the average packet size is 300 bytes. Fortunately, the OG-48 links we 
plan to monitor are not run at full link utilization. While there may be bursts of 
traffic which increase link utilization to 100%, Figure [El indicate that these types 
of bursts only occur at small time scales on OG-3 links, and similar behavior 
is expected on the OG-48 links. The OG-48 systems are configured with a 512 
MB memory buffer which can buffer eight seconds of data, and should be able 
to accommodate bursts which may occur. 
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5.3 Delay Measurements 

One of the unique aspects of the IPMON system is its capability to measure 
network delays for actual network traffic. Most current delay measurements are 
performed using a set of probe packets which are transmitted at periodic or 
random time intervals. While these systems are able to provide a general idea 
about the delay performance of the probe packets, it is difficult to determine 
if the performance of the probe packets represent an accurate sampling of the 
delays experienced by actual network traffic. 




Fig. 13. Delay from web-in to peer-out 



The IPMON systems allow us to measure the delays seen by every packet 
that is transmitted between two points in the network. This is accomplished by 
identifying the same packet in two different traces and computing the difference 
in the timestamps. Figure El shows a two minute sample data set. The x-axis 
shows the packet arrival time, and the y-axis shows the delay experienced by 
the packet for traffic between the web-in and peer-out links. Since both links 
are located on the same core router, this data represents the single hop delay 
experienced by packets in the backbone. We currently have systems installed 
only in one location in the network, but additional systems are in the process of 
being installed. With data from these systems, we will be able to measure delays 
across many hops in the backbone. 

There are several points to note from this figure. First, the minimum delay 
across the entire interval remains almost constant, around 30 /iS. This is the 
smallest delay interval that it is possible to measure in the network, so the 5 
pLS error that may be introduced by the clock synchronization mechanism is 
acceptable. Second, there is a rather large increase in delay at time 30 seconds. 
The delay experienced increases to nearly 30 ms, which is unusually large for a 
single hop delay. This type of delay is also observed at a small number of points 
elsewhere in the trace. These excessive delays are the types of data that are 
difficult to observe using probe traffic. The long delays are experienced only by 
a small number of packets, but the impact on these packets is extremely large. 
The source of these long delays is currently under investigation, but it is believed 
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to be due to a pathologic behavior of the router we are observing rather than 
actual queuing delays. 

6 Conclusion 

We describe the Sprint IP Monitoring system, a passive monitoring system that 
collects packet-level traces from the Sprint Internet IP backbone. The systems 
are capable of supporting OC-3, OC-12, OC-48 data rates, and are synchronized 
to within 5 /rs using a stratum-1 GPS reference clock. We present the system 
design and demonstrate the performance of the system with several sample mea- 
surements. 

The advantage of our system is it provides the capability to collect traces 
from multiple locations in the network and correlate the traces through highly 
accurate timestamps. This provides the capability to study both single link char- 
acteristics (e.g. workload and packet size distributions), as well as characteristics 
which require data from multiple links (e.g. delay, TCP behavior, network provi- 
sioning) . It is also very flexible in that the data is not targeted towards a single 
use. The packet traces are useful in many diverse research projects. 

The disadvantage of the system is that the amount of data collected is very 
large. Data from a single 24 hour period exceeds 3.3 TB. This requires both a 
large amount of resources to be installed in network facilities for data collection 
purposes and a large amount of resources to perform data analysis. While our 
system supports monitoring 31 different links and can be extended to monitor 
several dozen additional links, scaling the system to exhaustively monitor the 
entire network is impractical. 

In future work we plan to analyze in depth the traffic observed on various 
links on the network. These results will be used to: 

— Design provisioning and dimensioning tools to better anticipate customers 
needs and increase customer satisfaction, eventually making it possible to 
provide various classes of service. 

— Gain a better undestanding of the traffic characteristics on an IP backbone, 
and design more accurate traffic models. 

— Work with router designers to design embedded measurement facilities. 
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Abstract. In this paper, we explore the use of end-to-end unicast traffic 
measurements to estimate the delay characteristics of internal network 
links. Experiments consist of back-to-back packets sent from a sender to 
pairs of receivers. Building on recent work fl ll.'il4] . we develop efficient 
techniques for estimating the link delay distribution. Moreover, we also 
provide a method to directly estimate the link delay variance, which can 
be extended to the estimation of higher order cumulants. Accuracy of 
the proposed techniques depends on strong correlation between the delay 
seen by the two packets along the shared path. We verify the degree of 
correlation in packet pairs through network measurements. We also use 
simulation to explore the performance of the estimator in practice and 
observe good accuracy of the inference techniques. 



1 Introduction 

Background and Motivation. As the Internet grows in size and complexity, it be- 
comes increasingly important for users and providers to characterize and measure 
its performance and to detect and isolate problems. Yet, because of the sheer size 
of the network and the limit imposed by administrative diversity, it is not gener- 
ally possible to directly access and measure but a small portion of the network. 
Consequently, there is a growing need for practical and efficient procedures that 
can take an internal snapshot of a significant portion of the network. 

A promising approach to network measurements, the so called Network To- 
mography approach, addresses these problems by exploiting the end-to-end traf- 
fic behavior to reconstruct the network internal performance. The idea is that 
correlation in performance seen on intersecting end-to-end paths can be used 
to draw inferences about the performance characteristics of their common por- 
tion, without cooperation from the network. Multicast traffic is in particular well 
suited for this since a given packet only occurs once per link in the multicast 

* This work was supported in part by DARPA and the AFL under agreement F30602- 
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(c) Springer- Verlag Berlin Heidelberg 2001 



Network Delay Tomography from End-to-End Unicast Measurements 577 



distribution tree. Thus multicast traffic introduces a well structured correlation 
in the end-to-end behavior observed by the receiver that share the same multi- 
cast session. This correlation allows to infer the performance characteristics as 
packet loss rates, P, packet delay distributions, m and packet delay variance, 

m- 




Fig. 1. 2-Leaf Tree. 

To illustrate the idea behind multicast based delay inference, consider the 
simple tree in FigureQwith the source (the root node) sending multicast packets 
to the two leaf nodes L and R and assume we collect the end-to-end measure- 
ments at the two receivers. If we consider the events where the delay seen by L 
is zero (assume for simplicity that the transmission and propagation delay are 
zero), the corresponding additional delays seen at R can be attributed to the 
link from C to R alone. We can thus form an estimate of the delay distribution 
for the link from C to R. The delay distribution of the other links can be derived 
by similar arguments. 

Despite the encouraging results, multicast measurements suffer from two seri- 
ous limitations. First, large portions of the Internet do not support network-level 
multicast. Second, the internal performance observed by multicast packets often 
differs significantly from that observed by unicast packets. This is especially se- 
rious given that unicast traffic constitutes the largest portion of the traffic on 
the Internet. 

To overcome the limitation of multicast measurements, methods to extend 
the inference techniques to unicast measurements have been recently proposed in 
m for the inference of loss rates and m for delay distributions. The key idea 
is to design unicast measurement whose correlation properties closely resemble 
those of multicast traffic, so that it is possible to use the inference techniques de- 
veloped for multicast inference; the closer the correlation properties are to that 
of multicast traffic, the more accurate the results. The basic approach, which 
has been further refined in [Zj for the estimation of the loss rates, is to dispatch 
two back-to-back packets (a packet pair) from a probe source to a pair of dis- 
tinct receivers. The premise is that, when the duration of network congestion 
events exceeds the temporal width of the packets, packets experience very sim- 
ilar behavior when they traverse common portions of their paths. Difference in 
the packets behavior occurs because congestion events may not affect packets 
uniformly: packet loss could not be uniform if lossy periods last less than the 
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time between the arrival of the two packets; delays will differ because of the 
interleaving of background traffic. Still, if the packets experience very similar 
behavior, the error in using the multicast based estimator is very small. 

As an example, consider again the tree in Figure ^ with the source now 
sending two packets, back-to-back, the first to L and the second to R. In corre- 
spondence of the events where the delay seen by L is zero we will still attribute 
the additional delays seen at R to the link from C to R. But, because the two 
packets will possibly experience slightly different delays along the link from 0 to 
C, our estimate of the delay distribution for the link from C to R will contain 
an error roughly equal to the difference in delay seen by the two packets along 
common link. The smaller this difference, the more accurate the estimates. 

We observe that a more accurate approach would consist in taking into ac- 
count the difference experienced by the two packets along the shared link and 
incorporating it in our model. Unfortunately, we found out that it is not possi- 
ble to estimate its value, at least not without additional assumptions. Therefore, 
here we rely on small deviations from the ideal behavior and proceed as the two 
packets experience the same delay along the shared path. 



Contributions. In this paper we describe efficient techniques for the estimation 
of link delay characteristics, namely, the per link delay distribution and per link 
delay cumulants, via end-to-end packet pairs measurements. 

For the distribution analysis, our starting point is the work by Lo Presti, 
et al. HH and subsequent work by Coates and Novak in m- Following m. 
we model link delay by non-parametric discrete distributions. The discrete dis- 
tribution can be a regarded as binned or discretized version of the (possibly 
continuous) true delay distribution, where we explicitly trade-off the detail of 
the distribution with the cost of calculation. A potential limitation of this ap- 
proach lies in the accuracy/complexity trade-off itself. Since the complexity of 
the analysis is function of the numbers of bins, it results that under the usual 
discrete model, whereby delay is discretized using a fixed bin size q, a small q to 
ensure a desired level of accuracy in the estimates results in too many parameters 
(bins) and excessive computational costs. 

To overcome these limitations, here, we describe a novel approach to delay 
modeling. The idea is to discretize delay using variable sized bins. Smaller bins 
are used only in correspondence of concentrations of probability mass to ensure 
adequate resolution while larger bins are used otherwise. Intuitively, this allows 
us to reduce the number parameters (bins), and hence complexity, significantly, 
without losing accuracy. A complication with this approach is that a discrete 
model with variable bin size does not lend itself to analysis. To this end, we 
propose an approach to variable bin size modeling which, while restricting the 
possible choices of bin size to a specific format, lends itself to analysis. In partic- 
ular, we can formulate the estimation problem for the proposed variable bin size 
model, by generalizing the Maximum Likelihood formulation of 0. Estimation 
is carried out by adapting the Expectation-Maximization (EM) algorithm used 
in |S| to compute the MLE estimates. 
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Then, we also describe an efficient method to directly infer the per link delay 
variance. By a simple argument, we show that it is possible to express the link 
delay variance in terms of the covariance of the end-to-end delays. Therefore, we 
can estimate the variance directly from the sample covariance of the end-to-end 
delays. The same method can be extended for the estimation of higher order 
cumulants. Distribution and cumulants are closely related: knowledge of (all) 
the cumulants of a random variable is equivalent to know its distribution. 

The rest of the paper is organized as follows. In Section 0 we specify the 
tree and delay model. In Section 0 we describe the estimators of the delay dis- 
tribution. In Section ^Jwe describe the link delay variance estimator (for lack of 
space, we omit the extension to higher order cumulants). In SectionOwe use the 
National Internet Measurement Infrastructure (NIMI) to gather end-to-end 
data from a diverse set of Internet paths, and verify the conditions for the ac- 
curacy of our methods. In Section El we use network level simulation to evaluate 
the accuracy of the estimators. We conclude in Section 0 

Related Work. There exist several tools and methodologies for characteriz- 
ing link-level behavior from end-to-end unicast measurements. One of the first 
methodologies focuses on identifying the bottleneck bandwidth on a unicast 
route. The key idea is that, in an uncongested network, two packets sent back- 
to-back will arrive at the receiver with a spacing that is inversely proportional 
to the lowest link bandwidth on the path. This was noted by Jacobson 0, and 
analyzed by Keshav fm |. 

Use of end-to-end measurements of packet pairs in a tree connecting a sin- 
gle sender to several receivers for estimation of the link delay has been first 
considered in E]. The inference of the link delay distribution is formulated as 
a maximum likelihood estimation problem which is solved using the Expecta- 
tion Maximization (EM) algorithm. In m the authors extend this approach 
to the nonstationary case and in m investigate unicast based inference in con- 
text of passive monitoring, whereby inference is based on observation of ongoing 
unicast sessions. Preliminary results on these methods reported in these papers 
show promise. 

Our approach extend the results in 0 in that we consider a more gen- 
eral form of discrete model which allows us to significantly improve the ac- 
curacy/complexity trade-off. We remark that the variable bin size scheme pre- 
sented in this paper can be used in other setting, e.g., multicast based inference 
techniques. 



2 The Tree and Delay Models 

Tree Model. We represent the underlying physical network as a graph Gphys = 
(Uphys, Gphys) comprising the physical nodes Uphys (e.g. routers and switches) and 
the links Gphys between them. We consider a single source of probes 0 G I^phys 
and a set of receivers R C Uphys- We assume that the set of paths from 0 to each 
r G Ris stationary and forms a tree Gphys in (Uphys, Gphys); thus two such paths 
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never intersect again once they have diverged. We form the logical source tree 
T = (V,L) whose vertices V comprise 0, R and the branch points of T^hys- The 
link set L contains the link (j, k) if one or more of the probe paths in T^hys pass 
through j and then k without encountering another element of V in between. We 
will sometimes refer to link (j, k) G L simply as link k. For fc yf 0, /(fc) denotes 
the parent of k. We write j /c if j is an ancestor of /c in T. z V j denotes the 
minimal common ancestor of i and j in the ^-ordering. 



Packet Pair and Delay Model. Let (z,j) denote a packet pair dispatched to 
destination nodes i,j in that order. The paths traverse a common set of links 
down to node i V j. Let p{i,j) denote the set of nodes traversed by at least one 
member of the packet pair. For k € p{i,j) let G{k) C {1, 2}, where 1 and 2 denote 
the two packets sent in order to i and j, denote the set of packets that transit k. 
We describe the progress of the packet pair in T by the variable Xk{l), I G G{k), 
which represents the accrued queueing delay of packet d along the route to k. 
We assume that we only observe the end-to-end delay Xij = {Xi{l),Xj{2)) at 
receivers i and j. 

We specify a delay model for the packet pair. We associate with each node k 
a pair of random variables and D'f, that take values in the extended positive 
real line R+ U {oo}. By convention Dq = Dq = 0. Dk {D'j.) is the delay that 
would be encountered by the first (second) packet attempting to traverse the 
link (f{k),k) G L. A delay equal to oo indicates that the packet is lost on the 
link. We assume that delays are independent between different pairs, and for 
packets of the same pair on different links. The delay experienced by packet 1 
on the path from root 0 to node k is Xk{l) = The delay experienced 

by packet 2 is Xk{2) = Y.ly(^wj)wk D'l + 'E(iwj)vkyi'^k ^i- Note that Xk{-) = oo 
iff any delay along the path to k is infinite, i.e. if the packet is lost on some link 
between nodes 0 and k. 

For any k G V , Ek = D'^ — Dk is the difference between the delays experienced 
by the back-to-back packets of a packet pair traversing k. Ideally, Ek = 0, and 
the packet pair behaves like a notional multicast packet sent to the two receivers. 
In practice, we expect the two delays to be different. This is because congestion 
events at intervening nodes may not affect packets uniformly if they are not 
back-to-back. This occurs, for example, because of the packets being spaced 
apart as a result of traversing a bottleneck (low available bandwidth) link, and 
the interleaving of background traffic in between. Observe that Ek yf 0 even in 
the case of perfectly back-to-back packets, e.g., packet 2 suffers on additional 
delay due to the time required to transmit packet 1. 



Measurement A measurement experiment consists of sending, for each pair of 
distinct receivers i,j G R, n packet pairs (z,j). As a result of the experiment 
we collect a set of measurements where _ 

(Aj(l)^™\ Aj(2)('"^) and (Ai(l)^™\ Aj(2)^™^) is the end-to-end delay of the 
TO-th packet pair (z, j). Let X = (X*b)-^^.g^ denote the complete set of mea- 
surements. 
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3 Non-parametric Estimation of Delay Distribution 

In this section we describe techniques for the estimation of the probability distri- 
bution of the per link variable delay Dk ■ We quantize the delay to a finite set of 
values Q. We assume that once quantized, Dk = D'f.. In other words, we assume 
Ek small enough that it can be ignored in the discrete model. We consider two 
cases. First, in Section l.‘1.1 1 we consider the most usual form of discretization, 
where we discretize delay to a set Q — {0, q, 2q , . . . , Bq, oo}, where g is a suit- 
able fixed bin size. Then, in Section I.S.2l we consider a different approach whereby 
delay is discretized to a more general set Q. 

For the analysis, we thus model the link delay by a nonparametric discrete dis- 
tribution that we can regard as a discretized version of the (possibly continuous) 
actual delay distribution. We denote the distribution of Dk by ak = {oik{d))d^Q^ 
where ak(d) = P[Dk = d], d G Q. We will denote by a = {ak)kev the set of 
links distributions. 



3.1 Delay Analysis with a Fixed Bin Size Discrete Model 

Here, we consider the usual discrete model wherein Dk takes a value in Q = 
{0, q, 2q , . . . , Bq, oo}, where g is a suitable fixed bin size. The point oo is inter- 
preted as “packet lost” or “encountered delay greater than Bq" . We define the 
bin associated to iq € Q to be the interval [iq — ^,iq + |), i = 1, ... ,B, and 
[Bq — 2 , 00 ) the one associated to the value oo. Because delay is non negative, 
we associate with 0 the bin [0, |). We denote this model as the (q, B) model. 

Our goal is to estimate a using maximum likelihood based on the overall 
observed data X (also discretized to the set Q). Denote hy D = Qx Q the set of 
possible outcomes for the packet pairs delays For each outcome Xij G 17 denote 
n{xij) the number of pairs (i,j), m = l,...,n, for which = Xij. Let 

Pa(xij) = Pa[Xij = Xij] denote the probability of the outcome Xij. Pa{x''d) 
can be expressed in terms of convolutions of the distribution ak, k G p{i,j). 

The log-likelihood of the measurement X is 

£(A;a) =logP„[X] = E E n{Xi^j)\ogPa{Xij) (1) 



We estimate a by the maximizer of the likelihood (CQ , namely, a = arg maxc, E(a) . 
Unfortunately, given the form of we have been unable to obtain a direct ex- 
pression for a. Instead, we follow the approach in n and employ the Expec- 
tation Maximization (EM) algorithm to obtain an iterative approximation 
£ = 0, 1, . . . , to (local) maximizer of the likelihood O- The basic idea behind 
the EM algorithm is that, rather then performing a complicated maximization, 
we “augment” the observed data with unobserved or latent data so that the re- 
sulting likelihood has a simpler form. Following |S|, we augment the observations 
X with the unobserved actual delay experienced by the packet pairs along each 
link, namely, D = {D]^^)k(zp(i,j),i^jaR, where = (T>}’-^^'"^)m=i,....n are the 
delays experienced by the n packet pairs {i,j) along link k. The pair (X,D) 
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represents the complete data for our inference problem. The log-likelihood of the 
complete data {X, D) is 



C{X,D;a) = log [X, D] = log P„ [X |i?] + log P^D], (2) 

The first term is 0 (since D uniquely determines X, we have that Pq[X|Z?] = 1). 
Expansion of the second term yields 

logP„[i^] = E E logPa[E>fe'’] = E E nk{d)\ogak{d) (3) 

k^V d^Q 



where ni~{d) is the total number of packets pairs that experienced a delay equal 
to d along link k. Should D be observable, the counts nk(d) would be known, 
and maximization of m would directly yield the MLE estimate of cxk{d), 



ak{d) 



nkjd) 

J^deaMd) 



(4) 



Since D and rik{d) are not known, the EM algorithm uses the complete data 
log-likelihood C{X, D; a) to iteratively find a as follows: 

1. Initialization. Select the initial link delay distribution As shown in 

Appendix, we select as an estimate of a we compute by adapting the 
approach in El. 

2. Expectation. Given the current estimate compute the conditional ex- 

pectation of the log-likelihood given the observed data X under the prob- 
ability law induced by = EX(t)[C{X,D-,a')\X] =J2k^v 

'^d^Q^k{d) log a'i.{d) where hk{d) = [nk{d)\X]. Q{a'] has the same 

expression as C{X, D; a') but with the actual unobserved counts nk{d) re- 
placed by their conditional expectations hk{d). To compute nk(d), observe 
that we can write the counts nk{d) as nk{d) = j) 

El=i Then 



Md)= E = (5) 

i^jeR'.kep(i,j) m=l 

= ^ ^ ^ 'kli^ij)P’'^(l)[ddk = d\Xij = Xij] ( 6 ) 

3. Maximization. Find the maximizer of the conditional expectation = 

arg maxo,/ Q( q;', The maximizer is given by 0) with the conditional 

expectation rik{d) in place of nk{d). 

4. Iteration. Iterate steps 2 and 3 until some termination criterion is satisfied. 
Set a = where £ is the terminal number of iterations. 
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Convergence. Because the complete data likelihood can be shown to derive from 
a standard exponential family, the EM iterates converge to a stationary point 
of the likelihood a*, i.e., (a*) = 0, (see e.g. ^^1) • This implies that 

when there are multiple stationary points, e.g. local maxima, the EM iterates 
may not converge to the global maximizer. Unfortunately, we were not able to 
establish whether there is a unique stationary point or conditions under which 
unicity holds. Therefore, in general the estimates converge to a local (but 
not necessary global) maximizer. Since the point of convergence depends on the 
initial estimate, we must carefully choose the initial estimate . Here we select 
as initial distribution the estimate of a obtained by using the approach in mi 
(see the Appendix). We expect that, for large enough n, (which converges 
to a), is close enough to the actual likelihood maximizer so to ensure, in most 
cases, the desired convergence. 

Complexity. The complexity of the algorithm is dominated by the computation 
of the conditional expectation nk{d) which can be accomplished in time that is 
0{npB^), where p is the average number of links between the source and a leaf 
node, using the upward-downward probability propagation algorithm |2|. 

Choice of bin size. Since packet delay is essentially continuous in nature, the use 
of a discrete model introduces a quantization error, which is a function of the 
bin size q. The choice of q is thus primarily dictated by the trade-off between 
accuracy and computational complexity: a smaller q provides better accuracy 
but at rapidly increasing computational cost (observe that since the product 
qB is constant the complexity is basically 0{np/q^)); on the other hand, use of 
larger bin size, reduces the computational complexity but may not be adequate 
to accurately capture very small delays. 

We must also consider that, in the context of unicast measurements, the 
delay resolution must be large enough so that, once discretized to Q, we can use 
the approximation Dj^ Ri £)(,. Our network experiments in Section 0 suggest that 
q should not be smaller than 1msec to satisfy this condition. 

3.2 Delay Analysis with Variable Bin Size Discrete Model 

Here we consider a more general form of discrete model in which takes 
values in a more general finite set Q. This is motivated by the observation that 
the use of a fixed bin size may be too restrictive in the analysis of large networks 
where delay characteristics significantly vary from node to node: a value of q 
chosen to adequately capture the delay behavior of very fast links would result 
in too many parameters if slower or congested links are also present. Ideally, 
to overcome the limitations of the accuracy/complexity trade-off of the fixed 
bin size models, it is preferable to discretize delay to a suitable set Q, which 
guarantees the desired resolution in the delay range of interest while keeping the 
overall number of bins sufficiently small. For example, smaller bins could be used 
only in correspondence of concentrations of probability mass to ensure adequate 
resolution while larger bins could be used otherwise. Intuitively, this would allow 
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us to reduce the number parameters (bins), and hence complexity, significantly, 
without losing accuracy. 

A complication with this approach is that a discrete model with a general 
set of values Q does not lend itself to analysis. The problem, is that a general 
discrete set Q is not closed under the sum operation. Therefore, we cannot 
express the observable delay (discretized to Q) in terms of sum of link delays 
(also discretized to Q). 

To overcome these difficulties, we now describe a simple approach to variable 
bin size modeling which, while restricting the possible choices of Q to a specific 
format, lends itself to analysis. The key idea is to consider variable bin size mod- 
els, the analysis of which can be reduced to that of a set of fixed bin size models. 
We proceed as follows: (1) we define a variable bin size model as the composition 
(in the sense described below) of fixed bin size models; and (2), we choose the 
constituent fixed bin size models so that the estimates of the distribution for 
these models can be composed to form the estimate of the distribution of the 
variable bin size model itself. By appropriate choice of the fixed bin size models, 
the resulting variable bin size model has a better accuracy /complexity trade-off. 
We detail the approach below. 
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Fig. 2. Variable bin size model as composition of fixed bin size models. 

We define the variable bin size discrete model as the composition of M uni- 
form bin size discrete models ((® , i?i))i=i,...,M, with increasing bin size, 0 < gi < 
. . . < qm and such that Bipi < ... < BmQm (see Figure EJ. We assume that for 
1 = 2,..., M, each bin of level I either corresponds to an integer number of level 
I — 1 bins {i.e., the boundaries of the bin of level I correspond to boundaries of a 
group of adjacent bins of level I — 1) or is contained in the oo bin of level I — 1. 
We let gi{j) denote the set of level I — 1 bins which corresponds to the j-th level 
I bin, j = Q , . . . , B[ < Bi where B[ is the first level I bin contained in level I — 1 
last bin (the one corresponding to oo). 

In the variable bin size model, Dk takes values in Q = {0, q\, . . . , B\qi, i? 292 , 
. . . , B'jyjqM, oo}. We define the bin associated to iqi € Q as the interval [iqi — 
^,iqi + ^), and [BMqM — oo) the one associated to oo. With this definition. 
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we create a correspondence between the bins of the variable bin size model and 
bins of the fixed bin size models (the shaded bins in Figure EJ- This allows us to 
express the distribution a in the variable bin size model in terms of the delay 
distribution in the M uniform bin size models. For k € V, denote ak{d;qi) = 
P[Dk = d], d G Qi = { 0 , . . . , Biqi,oo} the distribution for the model with fixed 
bin size qi. The distribution of in the variable bin size model is then ak = 
{ak{d))d€Q, where ak{d) = ak{iqi;qi), d = iqi G Q, and 0^(00) = Ofc(oo;gM), 
k G V. We will take advantage of this correspondence for the estimation. 

With the above definition, we are limited to variable bin size models where 
the bin size progressively increases. However, we do not believe this choice to be 
restrictive. Indeed, we expect that in most cases it is desirable to have smaller 
bins in correspondence with small delay values and larger bins otherwise; while 
the small bins guarantee enough resolution for very fast or uncongested links, 
the larger bins prevent the explosion of the number of parameters due to the 
large delays experienced by the slower and congested links. 

Example. We consider the ternary variable bin size model defined, for a given 
base bin size q and number of levels M, as Delay is thus 

discretized to the set {0, g, 89, Qg, . . . , 3 ^~^q, oo}(see FigureEJ. This can be con- 
sidered as an extreme case where each level has only three bins, 0, and 00, 

and the bin size grows exponentially with the level. Observe that this model cov- 
ers the delay range from 0 up to a maximum value dmax with only 0(log3 
bins. 
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Fig. 3. Example: The Ternary Variable Bin Size Model {M = 4). 

We estimate the distribution a of the variable bin size model indirectly by 
taking advantage of the relationship between the bins of the variable bin size 
model and those of the component fixed bin size models. Basically, we estimate 
the probabilities of the former by the corresponding estimates of the latter. More 
precisely, estimation of a proceeds by computing recursively the MLE estimate 
of M discrete models (g;, Bi), starting with / = 1 as follows: 

1 . Discretize the delays to the set Qi. 
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2. Estimate the probabilities ak{d;qi), d G Qi, k G V. For Z = 1, we use 
the EM algorithm directly. For I > 1, to have consistency between the 
estimates of the different models, we compute the estimates of the prob- 
abilities of level I bins corresponding to a group of level I — 1 bins, di- 
rectly as the sum of the probabilities of those bins. In other words, we let 
ak{d-,qi) = Y,jegi(d/qi_i)^k:{jqi-i\qi-i) for d < qiB[. We then use the EM 
algorithm to estimate the remaining probabilities ak{d;qi) for d > qiB[ as- 
suming the probabilities ak{d] qi), d < qiB[^ as known parameters (set equal 
to the estimates above). This is equivalent to the EM algorithm shown in 
Section EH where we replace m with 






{d-,qi)=\l- ^ ak(d'-qi) 



nk{d) 



d’<B[qi 



E 



d'>B[qi 



nk(d') 



( 7 ) 



3. Iterate I and 2 for 1=1,..., M. 

4. Compose the estimates of the M models to estimate a, i.e., set ctk{d) = 
ak{iqv,qi), d=iqi € Q, k €V. 



Complexity. The computational cost equals the sum of the costs of computing 
the MLE estimates of each model. Assuming for simplicity that the number of 
iterations required by the EM algorithm does not vary, the complexity is then 

OinpEZiBf). 

Choice of the Variable Bin Size Model. The use of the variable bin size model 
provides great flexibility in terms of both accuracy and computational cost. We 
consider two examples below. To ensure high accuracy a simple solution lies 
in using a variable bin size model with only two levels, i.e., M = 2: the first 
level has a small bin size, chosen according the desired level of accuracy and 
enough bins to include most of the probability mass, e.g., B\ large enough that 
P\Dk < Biqf\ > 0.999; the second level has a larger bin size and covers the 
rest of the delay interval. We expect that capturing the tail of the distribution 
with a larger bin size can provide a significant reduction in the computational 
cost without accuracy degradation. At the other extreme, we might consider the 
solution which has the smallest complexity. Since the complexity is proportional 
to Bf, we simply have to minimize the number of bins per level and use 

as many levels as necessary. We thus obtain the ternary variable bin size model. 
In between these extreme cases, it is possible to consider several models which 
provide the desired accuracy complexity trade-off. In general we expect the model 
to be determined either a priori or based on the measurements themselves. 



3.3 Comparison of the Variable and Fixed Bin Size Model 

We illustrate the potential benefit of the variable bin size model using model- 
based simulations in which link delays are independent, exponentially distributed 
random variables. We assume no packet loss. We conducted 1000 independent 
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experiments over the 2-leaf tree in Figure H In each experiment, we sent 1000 
packet-pairs down the tree. We assumed that the back-to-back packets have the 
same delay along the common link. Average link delays were chosen indepen- 
dently with a uniform distribution in the interval [0.1, lOjmsec. 

For the analysis, we consider three different discrete models: the first two 
models are the two bin size models (1msec, 100) and (10msec, 10); the third 
model is the ternary model (3^“^lmsec, 2);=i 5. The number of bins in each 

model was chosen so that the largest finite delay in each model was about 
100msec. We used the EM algorithm for the estimation. Initialization was per- 
formed as described in the Appendix. The termination criterion for the EM al- 
gorithm was that successive iterates of any probability should have an absolute 
distance less of 10“^. 

Complexity. The computational costs differ substantially. To compare the costs, 
observe that each iteration has a complexity proportional to the square of the 
number of bins, which for the three models is 10,000, 100 and 9. The average 
number of iterations for the different models, was respectively, 22, 13 and 31 
(the last number is the sum over the 5 fixed bin size models). Thus, the fixed 
bin size model with bin size 1msec requires about two orders of magnitude 
more operations than what is needed by the variable bin size model, which, 
despite the need of executing the EM algorithm multiple times, has the smallest 
computational complexity. 

Table 1. Median of the Absolute Relative Error of the Average Delay Estimates. 





fixed bin size 


variable bin size 




q — 1msec 


q = 10msec 




all links 


2.6% 


8.1% 


17.6% 


links with average delay < 1msec 


9.7% 


64.4% 


14.6% 



Accuracy. We now compare the accuracy of the different approaches. In order to 
quantify the accuracy, in TableQwe list the median of the absolute relative error 
of the average delay estimates. As expected the best performance is achieved by 
the fixed bin size model with q = 1msec; use of a larger bin, while greatly reduc- 
ing the complexity, resulted in very poor accuracy for the smaller delays (if we 
consider only links with average delay smaller than 1msec, the typical error was 
64.4%). By contrast, the variable bin size model achieves good accuracy across 
the entire delay range, while at the same time enjoying a low computational 
cost. 

4 Non-parametric Estimation of Link Delay Variance 

In this section we present a class of non-parametric estimators of the link delay 
variance. We assume initially that all delays are finite: P[Dk = 00] = 0. We will 
later relax this assumption. 
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Fig. 4. Logical multicast Tree (left) and the two subtrees traversed by the pairs {i,j) 
(center) and (right). 



For a node k &V, consider the packet pairs (i, j) and {i', /), dispatched to the 
nodes i and j and i' and j', respectively, such that iV j = k and i'V j' = /(fc); see 
Figure 21 From the assumption that delays along different links are independent 
and the bilinearity of the covariance, for the packet pair {i,j) it follows that 



Cov[Xi(l),X,.(2)] = Cov[Xfc(l) + - Xfc(l)),Xfe(2) + (X,(2) - Xk{2))\ (8) 

= Cov[Xfc(l),Xfc(2)] (9) 

= Var[Xfc(l)] + Cov[Xfc(l), Xfc(2) - Xfc(l)] (10) 

= Var[Xfe(l)]+^Cov[A,F,]. (11) 

l^k 

Similarly, for the the packet pair {i',f) we have that Cov[Xi/(l), Xj/(2)] = 
Var[X/(fc)(l)] + J2i^f{k) Co\/[Di,Ei]. Observe that Xk{l) = -’f/(fe)(l) + Dk, and 
^/(fc)(l) “ J2iyf{k) are independent. Therefore, Var[Hfc] = Var[Xfc] — 

Var[Xj?(fc)] which we can rewrite 

Var[i9fe] = Cov[X,(l),X,(2)] - Cov[X,,(l), X,.(2)] - Qoy[Du,Ek] (12) 
Cov[X,(l),X,(2)] - Cov[X,,(l),X,.(2)] (13) 

under the assumption that |Cov[Hfc, if^,] | ^ Var[£)^,] (Observe that Co\j\Dj^,Ej^] = 
0, in particular, if E^ and are independent, or if Ej~ is constant). (II d|l ex- 
presses the variance of the packet delay along link k in terms of the covariance 
of delays measured at receivers. We can form an estimator of Var[Zlfe] (which is 
unbiased if Co\/[Dk, Ek] = 0) from the unbiased estimators of the end-to-end co- 
variances. More precisely, abbreviate Cov[Xi(l), Xj(2)] = Sij and Var[Hfc] = Vk- 
We can then estimate by the difference — 5)'^/ of the unbiased estimators 
of Sij and namely 



Sij — 



n — 1 



^ Jj^ij(i)(m)^j.i(2)(m) _ _ X;’^'(l)Mxj’^'(2)(’^ 



. m—1 






and similarly for Si'j'. 



( 14 ) 
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More generally, let Q{k) = {{i,j} C -R | i V j = fc, } be the set of distinct 
pairs of receivers whose ^-least common ancestor is k G V. Measurements of the 
packet pairs (i,j), {i,j} G Q{k) and {*^/} G Q{f{k)) yields estimates of 

Vk, namely, % as does any convex combination S{ij}GQ(fe),{z'p'}6Q(/(fc)) 

(where the ijiji/ji are non negative and sum to 1), which we can 

rewrite as 



Vfc(/i,s):= ^ fMj{k)sij- ^ ^i^'j'{f{k))si>j> (15) 

{*j}6Q(fe) {j',l'}GQ(/(fc)) 

where s = {% : {i,j} G Q{k),k G V}, fi{k) = (M*f(fc)){*.i}GQ(fc): = 

> 0, E{zy}GQ(fe)Mu(fc) = and similarly for fn,j,{f{k)). 
Finally, denote /i = {fj,{k), fi(f{k))). An example is the uniform estimator where 
all /r(fc) and fj,{f{k)) are constant. The uniform estimator has the disadvantage 
that a high variance of any summand may result in a high variance in the overall 
estimate. By proper choice of the weights we can determine the estimator 14 (/r, s) 
of minimum variance. 

The next theorem characterizes the asymptotic behavior of 14 (/r, s) and gives 
a form for the estimator of minimum variance. The proofs follow the same lines of 
those for the multicast case in and are omitted. Define Zi{l) = Xi{l) — E[Xi{l)], 
i G R, I = 1,2, and let Wij = \/a,r[Zi{l)Zj{2)\, j G R and ruk = Qo\i[Dk,Ek\. 

Theorem 1. For each k G V: 

(i) the random variables ^Jn- {'sij —vu +mk), {i,j} G Q(k) are independent and 
converge in distribution to a Gaussian random variable with mean 0 and 
variance Wij; 

(ii) for any choice of pL, ^Jn{Vk{pi,'s) — vu + ruk) converges in distribution to a 
Gaussian random variable of mean zero and variance + 

(Hi) the minimal asymptotic variance of the estimator Vk{pi,s) is achieved when 

w7^ 

Hijih) = p(j{h) := h G {k,f{k)}. The corresponding 

asymptotic variance of the estimator is ^ — ■ 

Z^{i,j}GQ(k) ’"o' Z^{i',j'}eQ(/(fc)) *">0' 

Theorem [D shows that 14(/r,s) is asymptotically normal. We define the esti- 
mator bias as bk = |A[14(/i, s) — Vk]\, k G V . For large n, we can use the approx- 
imation bk ~ \Co\/[Dk,Ek]\- Thus, under the assumption that \Co\/[Dk, Ek]\ <C 
\/ar[Ek], we have E[14(^,s)] Vk- 

Operationally, the weights p need to be calculated from an estimate Wij 
of the variances Wij. These can be computed as shown in The resulting 
estimator Vk{p*,'s), where p* is obtained by using Wij in place of Wij , has the 
same asymptotic behavior of 14 {p* , s) ■ 

Impact of Loss on the Estimators. We now relax the assumption of finite delays. 
We associate infinite delays to packet losses. Although lost packets will not 
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provide delay samples at receivers, clearly, the foregoing still applies to cumulant 
estimation based on the end-to-end delays of the received packets. For any packet 
pair define C n} the set of pairs for which both packets 

reach the leaf nodes; define Nn{i,j) = #/«(*, j) the number of such pairs. Denote 

< oo] the probability that the two packets of the packet 
pair reach the leaf nodes. Nn{i,j)/n converges almost surely to B(i,j) as n — )► oo. 
For large n we have approximatively Nn{i,j) ~ B(i,j)n delay measurements 
from both packets of the pair (i, j). 

We adapt the approach of the foregoing theory by estimating stj using only 
the measurements from the pairs in In{i,j). This corresponds to replacing n 
with Nn{i,j) and with J2mGi^{i,j) ^ dH- The effect of packet loss is to 

reduce the number of packet pairs available for estimation, thus increasing the 
variability of the estimates. The asymptotic behavior is characterized by results 
similar to Theorem Q where we replace Wij by 

5 Network Experiments 

The accuracy of the techniques described in Sections 0 and 0 rely on the assump- 
tions that: (1) the back-to-back packets in the packet pair experience roughly 
the same delay on each link along their common path, i.e., ~ (2) the 

additional delay experienced by the second packet is uncorrelated to the delay 
experienced by the first packet (or practically so), i.e., \Co\/[Dk, Ek]\ <C Var[Dfc]. 
In this section we investigate conformance of both of measurements of packet 
pairs transmitted across a number of end-to-end paths in the Internet to both of 
these assumptions. Although these experiments did not access the transmission 
properties of individual links (which are very difficult to measure), they are able 
to detect link-wise departures from the assumptions, since these would also be 
reflected in the properties of end-to-end paths over non-conformant links. 

Measurement Infrastructure. We conducted the experiments using the National 
Internet Measurement Infrastructure (NIMI) . NIMI consists of a number of 
measurement platforms deployed across the Internet (primarily in the U.S.) that 
can be used to perform end-to-end measurements. We made the measurements 
using the zing utility, which sends UDP packets in selectable patterns, recording 
the time of transmission and reception, zing was extended to transmit packets 
pairs with minimal spacing between packets. The resulting inter-packet spacings 
were of about 40^sec. 

Measurements were performed along end-to-end paths, by sending packet 
pairs from a sender to a receiver host. These measurements did not allow us to 
directly study the delay behavior of the pair along internal links, which would 
have required measurement inside the network. 

Here we report the results from 13 successful measurements made between 
11 NIMI sites (two of which are in Europe). Each measurement recorded at both 
sender and receiver the transmission of 6000 back-to-back packet pairs sent at 
exponentially distributed intervals with a mean of 100msec. All measurements 
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were made at either 2PM EDT (a busy time) or 2AM EDT (a fairly unloaded 
time) separated by a mean of 100msec. Since our focus is on the variable portion 
of the delay, in the results reported below we normalize each delay measurement 
by subtracting the minimum delay seen at the receiver. A delay equal to the min- 
imum delay is thus regarded as a variable delay equal to 0. In other words, we 
interpret the observed minimum delay as the constant propagation and trans- 
mission delay along the path, under the assumption that at least one packet 
experienced no queueing delay along the path. 

Delay Characteristics. In Table El we display the relevant delay statistics (mea- 
sured in msec) along each path, ordered in increasing average delay. 



Table 2. Summary Delay Statistics (in msec). 



E[D^] 


y^Var[Ufc] 




\/Var[Efc] 


Cov[D;,,£;^.] 

VarlDtl 


0.58 


0.40 


0.06 


0.14 


-2.73 • 10“^ 


0.62 


10.22 


0.08 


0.22 


7.81 • 10“* 


0.89 


2.63 


1.03 


0.36 


-3.16 • 10“* 


1.27 


7.50 


0.18 


0.88 


2.23 ■ 10"^ 


1.58 


8.74 


0.10 


0.22 


1.34 ■ 10“^ 


1.95 


1.62 


1.01 


0.18 


-2.78 ■ lO"'* 


2.64 


3.61 


0.08 


0.04 


-1.81 • lO"'* 


4.30 


23.31 


0.25 


0.34 


-1.10 • 10"* 


5.44 


42.70 


0.29 


0.90 


2.80 • 10"*' 


5.62 


60.31 


0.32 


0.45 


-5.66 ■ 10"*' 


37.28 


47.55 


0.26 


0.83 


2.79 ■ 10"'' 


63.72 


65.67 


0.61 


0.82 


5.98 ■ lO"'' 


65.97 


62.16 


0.42 


1.57 


5.19 • 10"® 



The average delay ranged from 0.6msec to 66msec, a span of two orders of 
magnitude. The entries in Tabled, with either a large average delay, a large stan- 
dard deviation, or both, correspond to six experiments involving sites in Europe 
(the last rows in Table E|) . Despite the delay diversity in our measurements, the 
difference in the average delay seen by back-to-back packets was somewhat more 
uniform, increasing with larger delay, but with an average and standard deviation 
typically below 1msec. This suggests that in practice we can use the approxi- 
mation Dk ~ D'f. as long as we adopt a delay resolution larger than 1msec, he., 
we discretize delay with bin size larger than 1msec. At these resolutions, indeed, 
the delays seen by the two packets can be considered identical. 

Finally, we turn our attention to the bias of the variance estimator. In TableEl 
we list the relative bias ■ The results show that variability of Ek is much 

smaller than that of Dk. The bias is only 3% in the worst case. 

6 Simulation Results 

The experiments of Section 0 show that the delay properties of back-to-back 
packets make packet pairs suitable for delay inference. In this Section, we employ 
simulation to evaluate how accurate the estimators might be in practice. 
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(b) 



□ Source 
Receiver 



Fig. 5. Network Topology and Logical Source Tree used in the Simulations. 



We used the ns simulation environment m this enables the representation 
of transport-protocol details of packet transmissions. The simulations reported 
in this paper used the 39-node topology of Figure EJ a). The buffer on each link 
accommodated 20 packets. Background traffic came from 420 sessions comprised 
of a mixture of TCP sessions and exponential and Pareto on-off UDP sources. 

We performed different sets of experiments. In each set we fixed a source 
and a set of receivers and conducted 100 experiments across the logical tree 
spanning those nodes. Measurement probes comprised packet pairs with a 1/rsec 
interpacket time. The packet pairs were generated periodically with an inter- 
packet time of 16 msec by cycling through pairs (i,j) sent to distinct receivers 
i,j. In each experiment, for each pair of distinct receivers i,j G R, n = 1000 
packets pairs (i,j) were transmitted. 

In order to evaluate the inference methods, we compare inferred delay statis- 
tics, namely, mean and variance, against the actual link delay as determined by 
instrumentation of the simulation. Here we will report the results for the logical 
7 receiver tree in Figure 0b) which covers part of the network. 



Table 3. Simulations Summary Delay Statistics (in msec) . 





E[Dfc] 


Y^Var[D|c] 


E[Ek] 






Cos/[D}^,Ek] 

Va,[Dj,] 




min. 


0.3 


1 


0.04 


0 


2.1 ■ 10“ 




median 


17.4 


17.6 


0.27 


1.1 


2 ■ 10“ 


3 


max 


55 


38.7 


0.47 


2 


2.03 ■ 10“ 


2 



Link Statistical Properties. We first examine the statistical properties of the 
underlying link processes. Characteristics vary considerably across the different 
links and in the different simulations. The average delay ranged from 0.3msec to 
55msec, and the delay variance from Imsec^ to l,500msec^. The link loss rates 
ranged from 0% to 18%. The link delay statistics are displayed in Table 0 The 
behavior and range is very similar to that observed in the network experiments. 
The important observation is that for 98% of the packet pairs the difference in 
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the delay seen by back-to-back packets was less than 1msec. We can thus use 
the approximation ~ D'f. as long as the delay resolution is larger than this 
value. The bias due to ignoring the term CoM\Dk,Ek] in the estimation of the 
variance is negligible and only about 2% in the worst case. 

Accuracy of Inference. We now compare inferred and actual link delay in the 
simulations. Here we focus on the estimation of the summary delay statistics, 
namely mean and variance. Given the large delay spread across the different links 
(delay was as large as a few hundreds msec), to infer the average delay we esti- 
mated the link delay distribution using the variable bin size model. We used the 
ternary variable bin size model (3*“^msec, 2);=i_...^5. For the analysis, delay was 
thus discretized to the set Q = {0, 1, 3, 9, 27, 81, 243, oojmsec, only eight bins. 

The estimate of the average delay is then E[Zlfe|Zl/e < oo] = ^ ^ ^ ^ 

k G V. As shown below, even if this model can be considered too coarse to ad- 
equately capture the probabilities of large delays, it allowed us to compute the 
estimates of the average delay efficiently and accurately. To estimate the link 
delay variance, we used directly the method described in Section 0 




actual average delay (msec) 




Fig. 6. Inferred vs. actual average and variance of link delay in simulations. Scatter 
plot for 100 experiments: (a) average link delay; (b) link delay variance. 



In Figure El we display scatter plots of inferred vs. actual link delay mean 
and variance. Accuracy increases for higher values as exhibited by the clustering 
about the line y = x. In order to quantify the accuracy of the estimates, we 
computed the median of the absolute error of the estimates of the link delay 
mean and variance. The median was 22.05% for the mean and 40% for the delay 
variance. Estimates were more accurate for larger delays: if we consider delay 
means larger than 10msec or delay variances larger than lOmsec^, the median 
of the relative absolute error fell to 10% and 11.75%, respectively. 

We can attribute the larger inference errors for smaller delays only in part to 
the fact that Dk yf and Co\/[Dk, Ek] 0. Observe, indeed, that especially for 
the variance estimates, the relative errors are quite large despite |Cov[Hfe, Efc]| <C 
Var[Z)fe]. We ascribe these larger errors to departure of the actual packets delay 
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from the independence assumption of the model. We calculated the coefficient 
of correlation of packet delays on consecutive links. The median was 0.09, the 
maximum value 0.57. We believe that the higher correlations are a result of the 
small scale of the simulated network. In general, we expect correlations to be 
smaller in real networks because of the wide traffic and link diversity. The large 
effect that correlation has on the estimates of the variance can be explained 
by observing that, because of the independence assumption we ignore all the 
cross-correlation terms when we derive the expression for the variance estimator 
(equations In the presence of correlation, these terms are not negligible 

and can be significantly larger than the smaller variances. On the other hand, 
we observed that estimation of the average delay is more robust and does not 
significantly suffer from the violation of the independence assumption. This is 
not unexpected since unlike the variance, the mean of a sum is always equal to 
the sum of the means irrespective of the underlying correlation structure. 

7 Conclusions 

In this paper, we explored the use of end-to-end unicast traffic measurements 
to estimate the delay characteristics of internal network links. Measurement 
experiments consist of back-to-back packets (a packet pair) sent from a sender 
to pairs of receivers. We develop efficient techniques to estimate the link delay 
characteristics, namely, delay distribution and delay variance. 

For the estimation of the delay distribution, building on previous work in m 
and the recent work of the authors of m. we proposed a novel approach for the 
estimation of the link delay distribution. The key idea is the use a variable bin 
size model, wherein smaller bins are used in correspondence of concentrations of 
probability mass and larger bins otherwise. We consider a variable bin size model 
the analysis of which can be reduced to that of fixed bin size models. Through 
examples, we showed that, compared to previous approaches, we are able to 
significantly reduce the computational complexity, without losing accuracy. 

We also provided methods to directly estimates the link delay variance. We 
express the link delay variance in terms of the covariance of the end-to-end 
delays. Therefore, we can estimate the variance from the sample covariance of 
the end-to-end delays. The method can be extended to the estimation of higher 
order cumulants. 

Accuracy of the proposed approaches depends on strong correlation between 
the delay seen by the two packets along the shared path. We verified the de- 
gree of correlation in packet pairs through network measurements. We also used 
simulation to explore the performance of the estimator in practice and observed 
good accuracy of the proposed inference techniques, although violation of some 
of the model assumptions, e.g., spatial correlation, introduces systematic errors. 
This will be object of further study. 
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A Computation of 

We illustrate the method for the computation of Let Ak(d) = PQ[Xfc(l) = 
d], k G V the probability that the first packet of the pair reaches k in d unit 
of time. For each pair {i,j} G Q{k), we use the approach in m to com- 
pute an estimate {d) of Ak{d) from the empirical distribution of by 
solving a system of polynomial equations. Since Xk{V) = J- Dk and 

and Ufc are independent we obtain an estimate of the distribution of Ufc 
by deconvolution of the estimates of the distributions of Xfc(l) and 
We use this estimate as initial distribution. More precisely, for k G V, we let 
af\d) = {Ak{d)-T,a,^Q^a,<^Af^k){d')a^°\d-d'))/Af^k){0),dG Q\oo, where 
Md) = and let ai°^oo) = 1 - Edee\oo d^k^d)- It is 

possible to show that is a consistent estimator of a and, as n goes to infi- 
tity, — a) converges in distribution to a multivariate Gaussian random 

variable with mean 0 and covariance matrix a a. 
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Abstract. Recent research has shown the presence of self-similarity in 
TCP traffic which is unaffected by the application level and human fac- 
tors. This suggests the presence of protocol level contributions to network 
traffic self-similarity, at least in certain time scales where the effect of 
protocol behavior is most prominent. In this paper we show how TCP’s 
retransmission and congestion control mechanism contributes to the self- 
similarity of aggregate TCP flows. We develop a mathematical formu- 
lation which shows that TCP’s retransmission and congestion control 
mechanism results in packet dynamics of a TCP flow being analogous 
to a number of ON/OFF sources with OFF periods taken from a heavy 
tailed distribution. Using well known limit theorems, we then show that 
this contributes to the self-similar nature of TCP traffic. Our model 
shows a direct correlation of the loss rates to the degree of self-similarity. 
Measurements on traces collected by us also exhibit this relationship 
predicted by our model. 



1 Introduction 

Research on the causes of self-similarity in network traffic have primarily focused 
on the application level characteristics of high-speed networks and the human 
factors involved. In ED, the causes of the self-similarity are investigated at the 
source level. In P the authors cite the distribution of file sizes, the effects of 
caching and human factors like response time and preference as possible causes 
for the self-similarity in WWW traffic. On the other hand, protocol level causes 
of self-similarity in network traffic has been investigated in |2] and m which 
showed that closed loop protocols like TCP lead to much richer scaling behavior 
than open loop protocols like UDP. 

In this paper we show that TCP can contribute to the self-similarity of net- 
work traffic and its contribution is visible in the time scales ranging from mil- 
liseconds to tens of seconds. Thus though TCP may not be able to contribute at 
higher time scales, the observed self-similarity in these scales can be attributed 
to application and human level causes which inherently operate at time scales of 

* Supported in part by DARPA contract F30602-00- 2-0537 and in part by DoD MURI 
contract F49620-97-1-0382. 

S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 596-[^3 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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minutes and hours. Also, though in the pure mathematical sense a self-similar 
process should exhibit the same statistical characteristics over all possible time 
scales, this is not possible in real systems due to physical limitations. We show 
that TCP is capable of causing scaling in 3 to 4 time scales (few milliseconds to 
10s of seconds) and it is in this sense that we call TCP traffic is self-similar. This 
range of timescales is generally sufficient for traffic modeling purposes as shown 
in citeCrBo99, since the range of relevant timescales is determined by the finite 
buffer sizes of real systems. 

In JE], the authors attribute the self-similarity of TCP traffic to the chaotic 
nature of TCP’s congestion control mechanism. The adaptive nature of TCP’s 
congestion control is suggested as the cause for the propagation of self-similarity 
in the Internet in m- The main aim of our paper is to understand the effects of 
TCP’s retransmission and congestion control mechanism on the observed self- 
similarity of TCP traffic. Our results show that the timeout and exponential 
backoff mechanisms in TCP play a crucial in inducing self-similarity. We also 
show that the degree of self-similarity has a direct relationship with the losses 
experienced by a flow with the traffic no longer self-similar, i.e. H « 0.5 for very 
low loss rates. While similar phenomena have been reported recently (after this 
paper was completed), their models to explain the self-similarity either require 
unrealistic loss rates to induce self-similarity |7| or are able to show long-range 
dependence over very small time scales jSj. In this paper, we present a model 
of TCP based on ON/ OFF processes which explains the self-similarity of TCP 
traffic and validate it using TCP traces collected from the Internet. We also 
give a mathematical formulation of how TCP’s congestion control mechanism 
leads to self-similarity in the traffic it generates and account for the effects of 
the network in terms of the loss probabilities and the presence of other flows. 

The rest of the paper is organized as follows. In Section 0we first present the 
results of tests on traffic traces generated by individual TCP transfers over the 
Internet showing proof of self-similarity. We then present a model which explains 
this self-similarity and experimentally validate our model using the same TCP 
traces. In Section E| we provide a mathematical foundation for our model and 
investigate the mechanisms of TCP which contribute to self-similarity in greater 
detail. Finally, Section E presents the discussions and concluding remarks. 



2 Self-Similarity of TCP Flows 

In this section we provide experimental evidence of the self-similarity of individ- 
ual TCP flows which motivates the investigation of TCP dynamics for causes of 
self-similarity. In m the authors showed that the data sent by an isolated TCP 
flow from the superposition of a number of TCP flows shows evidence of self- 
similarity and attribute it to the chaotic nature of TCP’s congestion avoidance 
mechanism. All previous reports of self-similarity in network traffic concentrated 
on the self-similar characteristics of the aggregated traffic. However, the results 
in m were generated by carrying out experiments using the simulator ns which 
is not an exact reflection of the actual scenarios in the Internet. Hence to dis- 



598 



B. Sikdar and K.S. Vastola 




(a) Loss prob = 0.010, H = 0.70 ± 0.01 
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(b) Loss prob = 0.078, H = 0.72 ± 0.05 




(c) Loss prob = 0.135, H = 0.76 ± 0.05 



Fig. 1. Tests for self-similarity for the various traces to Columbus, Ohio. For each 
trace, the figures show the results from the absolute value method (left), R/S statistics 
method (middle) and the Periodogram method (right). 



pel any doubts about the self-similar nature of single TCP microflows, we first 
present the results from tests for long-range dependence on traces collected from 
real life TCP connections over the Internet. 

We first give a brief description of the datasets. We collected traces for data 
transfers originating from a machine running Solaris 2.6 at RPI, Troy, NY. 
The destinations for the transfers were in Ohio State University, Columbus, 
OH (HP-UX), University of California, Los Angeles, CA (FreeBSD Cairn-2.5), 
Massachusetts Institute of Technology, Boston, MA (Linux 2.0.36) and Univer- 
sity of Pisa, Pisa, Italy (FreeBSD 3.3). Due to space restrictions, we show results 
for only the transfers to Ohio and Italy. The results for the others are similar. 
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(a) Loss prob = 0.001, H = 0.51 ± 0.01 




log(m) logfd) logo.) 



(b) Loss prob = 0.006, H = 0.67 ± 0.03 




(c) Loss prob = 0.099, H = 0.73 ± 0.03 



Fig. 2. Tests for self-similarity for the various traces to Pisa, Italy. For each trace, the 
figures show the results from the absolute value method (left), R/S statistics method 
(middle) and the Periodogram method (right). 



Each trace is 2000 seconds or around 33 minutes long and was collected using 
tcpdump which did not lose any packets. The transfers were done over periods in 
1999 and 2000 at various times of the day and week. Depending on the prevalent 
network conditions, the loss rates experienced by each flow is different and we 
use this to classify transfers between a source-destination pair. 

Figure ^ shows the results of the tests for long-range dependence on three 
traces to Ohio which had loss rates of 0.010, 0.078 and 0.135. Figure |3 shows 
the results of similar tests on the traces collected from transfers to Pisa which 
had loss rates of 0.001, 0.006 and 0.099. We tested for long-range dependence 
using three of the widely used methods HS|: the absolute value method, R/S 
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t t t 



(a) ccdf: p — 0.010 (b) ccdf: p — 0.078 (c) ccdf: p = 0.135 




(a) Hill: p = 0.010 (b) Hill: p = 0.078 (c) Hill: p = 0.135 

Fig. 3. Tests for heavy-tailed nature of the OFF times for the various traces to Colum- 
bus, Ohio. For each trace, the figures show the ccdf plots (top) and the corresponding 
Hill’s estimates (bottom) for various values of w. 



statistics method and the periodogram method. The results clearly show the 
long-range dependence in the individual TCP flows. Also the degree of long-range 
dependence, as indicated by the Hurst parameter, is clearly dependent on the 
loss rate experienced by the flow, with higher loss rates leading to larger values 
of H. Also note that for extremely low probabilities (less than 0.001) the traffic 
is no longer self-similar as indicated by the Hurst parameter of approximately 
0.5 as shown in section (a) of Fig. El We describe this in detail in the following 
subsection and in Section 01 

This poses the following questions. What are the underlying mechanisms 
which are responsible for the direct influence of the loss probabilities on the self- 
similarity of TCP traffic? What role does TCP’s fast-retransmit and timeout 
mechanisms play in all this? In this paper we address these issues and show how 
TCP’s retransmission and congestion avoidance mechanisms contribute to the 
self-similar nature of network traffic. 



2.1 ON/OFF Model Based Explanation and Its Validation 

TCP follows a window based flow control mechanism and transmits a certain 
number of packets in each “round” . We define a round as in El- A round begins 
with the back to back transmission of a window of packets. After these packets 
are transmitted, no other packet is transmitted till an ACK is received for one 
of these packets. The receipt of an ACK marks the end of the round. 
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Fig. 4. Tests for heavy-tailed nature of the OFF times for the various traces to Pisa, 
Italy. For each trace, the figures show the ccdf plots (top) and the corresponding Hill’s 
estimates (bottom) for various values of w. 



To give an explanation for TCP’s effect on the self-similarity of network 
traffic, we consider a TCP flow to be composed of the superposition of Wmax 
ON / OFF processes. Each process corresponds to each of the possible values that 
the cwnd of the flow might have since W^ax is the receiver’s advertised maximum 
buffer size and is the upper limit on cwnd. A cwnd of w, corresponding to the 
ON/OFF process, 1 < u> < Wmax, implies a deterministic ON time which 
is equal to the time to transmit the w packets with the packets generated at a 
constant rate in during this period. We note that though in practice there might 
be a small variation in the time between two successive packets in a round, 
these are generally very small and with high speed networks these variations are 
negligible when compared to the RTTs. Also, as described after a few paragraphs, 
the way we demarcate the end of ON periods ensures that the spacing between 
the packets in the ON period is almost constant. 

The OFF period for the process, 1 < < Wmax, corresponds to the time 

interval between two successive instants where cwnd has the value w. Now, if 
the distribution of these times has a heavy tail, their complementary cumulative 
distribution function (ccdf) Fc{x) behaves like 

FcX ^ lx~°‘L{x) with 1 < a < 2 (1) 

where Z > 0 is a constant, L(x) is a slowly varying function at infinity, i.e., 
\\mx^ooL{tx) / L{x) = 1,V t > 0 and the relation f{x) ~ g{x) implies lima,_,.oo 
f{x)/g{x) = 1. We can now use the following Theorem from ^21 which says 
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that the superposition of a number of these processes converges in the limit to 
fractional Brownian motion (fBm) and thus exhibit self-similarity. 

Consider M independent ON/OFF processes. Let (^ 2 '^) 

and (^ 2 ^’'^) the ccdf, the mean duration and the variance of the ON 

(OFF) period of the ON/OFF process of type r. Now, if VFm(F<) represents the 
aggregated packet count in the interval [0,Tt] due to the contribution from all 
the M sources then 

Theorem 1. (Taqqu, Willinger and Sherman) As — >■ 00 , r = 1, • • • , i? 

and T — >■ 00 , the aggregated eumulative paeket trajfic {WM(Tt),t > 0} behaves 
statistieally like 



TtY. 



(r) (r) 

— 1 Ml + M 2 



-b ^ ^ Lir){T)MF)a^^BHir, (t) 



where the Bjj(r) (t) are independent fractional Brownian motions and and 
are as defined in 

In our case, R corresponds to the maximum window size allowed for any 
of the flows in the network and the limiting conditions are reached when we 
have a large number of flows in the network each contributing its ON/OFF 
processes to the superposition. Now we just need to show that the distribution 
of the OFF times indeed corresponds to the form of Eqn. ^ In Figs. 0 and 0 
we plot the ccdf of the OFF times for various window sizes for the traces for 
Ohio and Italy and the heavy tailed nature of each is clearly evident. While 
the ccdf plots often provide solid evidence for or against the presence of heavy 
tails, an eyeballing method is statistically unsatisfactory and the rough estimates 
of a obtained from these plots may be unreliable. A statistically more rigorous 
method for estimating the slope of the tails and thus a is the Hill’s estimator |H|. 
The presence of heavy tails is indicated by a straight line behavior of the Hill’s 
estimate &n as the number of samples used in the calculation of the estimate 
increases while a steadily decreasing pattern is a strong indication of the data 
being not from a heavy-tailed distribution. Figs. 0 and 0 also plot the Hill’s 
estimates for the OFF time distribution for various window sizes for the Ohio 
and Italy traces respectively and clearly they are consistent with the form of 
Eqn. 0 Thus we can conclude that the superposition of such ON/OFF process 
from a number of TCP flows will converge in the limit to fBm and thus exhibit 
self-similarity. 

It is interesting to note the ccdf and Hill estimate plots for the Italy trace 
with p = 0.001. From Figure 0the Hurst parameter for this trace can be seen to 
be around 0.5, i.e. the trace does not exhibit self-similarity. We note from Figure 
0that the Hill estimates for all the ON/ OFF process corresponding to this trace 
are decaying constantly and thus do not have a heavy tailed nature. Thus the 
ON/OFF processes corresponding to this trace do not satisfy the conditions of 
Theorem 1 and as a result the trace is not self-similar. In Section E31 from our 
derivation of a lower bound of the ccdf it will be clear why low loss rates fail to 
give rise to heavy tails. 
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An important assumption here is the independence of the window sizes of 
different flows, which need not be the case for all the flows in a link. Simulation 
studies have indicated that the window sizes of TCP flows sharing a common 
bottleneck link may get synchronized though such synchronization is hard to 
observe in the Internet HH. Also, most of the simulation studies focus on very 
heavily congested bottleneck links while link loads in practice tend to be com- 
paratively much lower. Also, note that the independence requirements fail to be 
satisfied only when nearly all the flows in a link are correlated. To prove that 
the independence assumptions of Theorem 2 of HH are satisfied, we analyzed 
some of the traces reported in M- The results of our statistical tests on these 
traces to see if the individual TCP flows are indeed independent indicate that 
amongst the longer flows in the traces, roughly 35-70 % of the flows are mutually 
independent, providing enough independent flows in the superposition. 

An important part in the calculation of the OFF times is what criterion 
we use to define a OFF period. We define an ON period to be over whenever 
the distance between two successive packets in the trace exceeds a length S 
dependent on the packet transmission time on the link. By keeping S sufficiently 
small we can ensure that the spacing between the packets in the ON period is 
almost constant thus satisfying the requirement of Theorem 1. Also, as in EH, 
the exact numerical choice of 6 does not affect the results and the heavy tailed 
nature of the ccdf remains an invariant independent of the choice of S. 

3 Investigating the Role of TCP 

Having presented a model explaining the self-similarity of TCP traffic we now 
pinpoint the sources in TCP’s retransmission and congestion avoidance mecha- 
nism which are responsible for this phenomena. We then derive a lower bound 
on the tail of the OFF time distribution and show that it decays according to a 
power law providing a firm mathematical foundation to our model. In this paper 
we concentrate on TCP Reno as it the most widely deployed variant of TCP. 
The effect of the other versions of TCP is discussed in Section 21 We assume 
that the reader is familiar with the basic concepts of TCP like the congestion 
window cwnd, slow start, delayed acknowledgments etc and refer the reader to 
pm for details on TCP’s algorithms. 

3.1 The Impact of Timeouts 

From the explanation for the observed self-similarity in TCP traffic given in 
Section El it is obvious that the central aspect of the phenomenon lies in the 
infinite variance or the heavy tailed nature of the OFF time distributions. Let 
us now consider the features of TCP which lead to such a behavior. 

In the following we assume an infinite or steady state flow currently in the 
congestion avoidance mode to make the visualization easier. Consider a TCP 
flow with a current window size of w, w < Wmax- In every round that follows, 
the window now increases linearly until it reaches Wmax and we need a loss for 
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the window to drop back so that we get a window of size w again. Note that 
if the window reaches a value greater than 2w before a loss indication and it 
results in a fast retransmit, the subsequent congestion avoidance mode will start 
with a window greater than w leading to even longer times before a window of 
w is reached. However the occurrence of heavy tails is mainly due to the loss 
indications which lead to timeouts. This is due to the following reasons. A time- 
out represents a significant duration when no packets are transmitted and acts 
as a boundary between ON and OFF periods of the flow as a whole leading to a 
bursty nature of TCP traffic. The durations of timeouts are generally an order 
of magnitude greater than the RTT m and with coarse TCP timer granulari- 
ties and variations in the RTT measurements can be quite large. Again, if the 
retransmitted packet following a timeout is also lost, the silent period is doubled 
and from the traces reported in m the occurrence of multiple consecutive time- 
outs is frequent. Also, a majority of the losses experienced by TCP flows lead 
to timeouts which can be attributed to the fact loss that most routers in the 
Internet deploy droptail queues. Correlated loss models, where all the packets 
following the first dropped packet in a round are also dropped are an appropriate 
models for the losses arising from these queues m- This coupled with the fact 
that a single loss in a window less than 4, two or more losses in a window less 
than 8 and three or more losses for higher windows in TCP Reno will lead to 
a timeout contributes to the large proportion of timeouts in the observed loss 
indications. Before moving on to the derivation of the lower bound on the tail of 
the ccdf, we first derive the probability that a loss in a window of size w leads 
to a timeout. 



3.2 Probability of Timeouts 



Consider a round with window w and let the probability that a loss of any packet 
in this round will lead to a timeout be denoted by Q{w). We assume that the 
receiver sends one ACK for every two packets it receives. We assume that all 
losses are due to packet drops at intermediate queues and that losses due to data 
corruption are negligible. We also assume droptail queues and the correlated loss 
model of the previous subsection. Packet losses in a round are assumed to be 
independent of losses in other rounds and the packet loss probability is denoted 
by p. 

For window sizes less than 4, any packet loss leads to a timeout and thus 
Q{w) = 1 for 1 < ui < 3. For windows with 4 < w < 8 (or AT -|- 1 to 2{K + 1)) 
two or more packet losses in a round leads to a timeout. If only one packet 
is lost in the current round, if we lose any packet in the following round, the 
flow will eventually timeout. In addition the retransmitted packet must also be 
transmitted successfully to avoid a timeout. Thus the probability that a packet 
loss does not lead to a timeout for this range of window values is given by 



p(l-p)^ ^ 

1 - (1 -p)"' 



(l_p)»-i(l_p) 



1 - Q{w) 



( 2 ) 
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The first term corresponds to the probability of exactly one packet loss in a 
window of w. The second last two terms correspond to the probability that 
all the tc — 1 packets in the following round and the retransmitted packet are 
received correctly. Thus 

= for4<u;<8 (3) 

For window sizes greater than 8, three or more losses in a round will lead to 
a timeout. Also we have to ensure that the retransmitted packet is received 
successfully along with the fact that none of the packets in the succeeding round 
are lost. Neglecting the extremely few possibilities in which it is possible to 
recover a single loss in the succeeding round without going into a timeout, we 
thus have 



Q{w) = 1 - 



p(2-p)(l-p)^-^ 
1 - (l-p)^ 



for 9 < W < Wmax 



3.3 A Lower Bound on the OFF Time Distribution 

We now derive a lower bound on the ccdf by identifying the possible ways in 
which the time between two successive windows of the same size can exceed a 
given value. We concentrate on the most likely paths that the cwnd is likely to 
follow while not accounting for the others as their contribution to the ccdf is 
negligible. In this derivation, we measure time in units of the round trip time. 

Let us assume that the current window size is w and we want to find the 
probability that the time until the next instant where cndw = rc is greater than 
100. The most obvious possibility is that the flow does not experience any loss for 
the next 100 rounds so that after some round the cwnd stays at Wmax- However, 
with higher loss probabilities this event is unlikely and the probability tail based 
on just this mechanism has an exponential decay. Another possibility could be 
that after i rounds (when cwnd > 2w) the flow experiences a loss which results 
in a fast retransmit. The flow then transmits the next 100 — f rounds without any 
loss. As a variation of this we could have a number of successive fast retransmits 
without reaching a window of w. Note that each of these possibilities are mutually 
independent and their individual contribution the tail of the distribution has an 
exponential decay, each having its own rate. Yet another line of possibilities is 
timeouts. Let us denote the average duration of a timeout (in terms of RTTs) 
by E[TO]. As the first possibility we could have that there are no losses in the 
first 100 — E[TO] followed by a timeout. We could also have i initial rounds 
without loss and then n timeouts (with n sufficiently large) before the window 
gets a chance to increase to w. Other possibilities include cases where we have 
timeout periods of length 2E[TO], 4E[TO] and so on. Again, each of these cases 
represent independent possibilities whose individual contribution to the tail of 
the OFF time distribution has an exponential decay, the rate of which depends 
on the corresponding probability of the loss indications and their effects. 
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The tail of the OFF time distribution for each window size and the cor- 
responding ON/OFF process can thus be seen as the superposition of a large 
number independent exponential tails each with its own rate of decay. The mix 
of these independent exponentials leads to a composite distribution which has 
a heavy tail over the region of our interest. The following theorem by Bern- 
stein 0 provides the link between the mixture of exponentials and a completely 
monotone probability density function (pdf). 

Theorem 2. (Bernstein) Every completely monotone pdf f is a mixture of 
exponential pdfs, i.e., 

nOO 

f{t)= Ae-^‘dG(A), t>0 (4) 

Jo 



for some proper cdf G. 

It can be shown that the commonly used heavy tailed distributions like Pareto 
and Weibull are completely monotonous. Also, in 0 it is shown that the super- 
position of a number of properly chosen exponentials can be used to model 
heavy tailed distributions in the region of primary interest. Having shown the 
basic construction of how the mix of exponentials lead to heavy tails in the OFF 
time distribution, we now obtain the probabilities corresponding to each of the 
possible paths that we described. 

Case 1: The no loss case. Let us begin with the simplest case where there 
are no losses. Consider the ON/OFF process which corresponds to a cwnd 
of w, 1 < w < Wmax excluding the special cases with cwnds of 1 and W^ax- 
Assume that the current round has a window of size w. The probability that the 
next window of size w occurs after t units of time (i.e. t RTTs) assuming there 
are no losses in between is given by 

P{r>t} = (l-p)^(‘) (5) 



where N{i) represents that number of packets that are transmitted in the i 
rounds following the round with size w and is given by 



N{i) 



fiw-k ni (i- ril) ifi<j 

\ jw -k I'll {j - ni ) + (* - jWmax else 



( 6 ) 



where j = 2(Wmax — w) — 1 and represents the time it takes for the cwnd to 
reach Wmax, assuming no losses. 

Case 2: Fast retransmission losses. We now consider the more likely 
cases where a flow experiences n losses between two successive windows of the 
same size which are far apart in time. Consider again the ON / OFF process, 

1 < w < Wmax- We can have a OFF time greater than t if we have loss indications 
at windows greater than 2w which result in fast retransmits. For simplicity, we 
consider only those cases where the loss occurs in a window of size Wmax- The 
flow first transmits packets without loss for the first i rounds during which its 
window reaches Wmax- It then experiences a loss which is recovered by a fast 
retransmit. Since w < \Wmax!^ the desired window size is not achieved at the 



On the Contribution of TCP to the Self-Similarity of Network Traffic 



607 



beginning of the congestion avoidance mode. Also, following each loss there are 
2(Wmax — m) — 1 rounds with Wmax(Wmax — 1) — m{m + 1) packets till cwnd 
reaches Wmax again with m = \Wraax!^ ■ Thus there are t—n—n{2{Wmax—m) — 
l) — 2{Wmax — 'w) + l rounds with successfully transmitted windows of Wmax- The 
total number of correctly transmitted packets, after algebraic simplifications, is 
thus 



Nc{w, t) = Wmax{t - {n+ l)Wmax + 2w + 2nm - 4n) 

—w{w -hi) — nm{m — 1) (7) 

Now, since there are M = t — 2nWmax + 2w -I- 2(n — l)m — 2n + 3 rounds with a 
cwnd of Wmax with n of them having losses, the probability that the OFF time 
is greater than t is given by 

P{T>t}=(^^yi-{l-p)^-^r{l-Q{Wmax)ni-pf^^^’^'> (8) 

Also, since each loss is associated with 2{Wmax — 'm-) — ^ rounds where the window 
is not Wmaxi the maximum possible losses in t rounds can be shown to be limited 

by 



^max 



t — 2(Wmax — w) -h 1 
2^Wmax 2m 1 



(9) 



Case 3: Loss indication resulting in a timeout. Let us now consider 
the case when the TCP flow experiences a single loss indication which results in 
a timeout . Consider the case when the loss occurs after i rounds from the round 
with a window of w. The number of packets transmitted in these i rounds, N(i) 
is given in Eqn. Eland the value of the cwnd in the round Wi is given by 



Wi = min {Wmax, w + [i/2] } 



(10) 



To find the number of packets transmitted in the slow start phase which follows a 
timeout, we use the model of which models the window increase pattern more 
accurately than the commonly used approximation where the window always 
increases 1.5 times every RTT. From PSI, the number of rounds spent in the 
slow start phase is given by 



tss{Wi) = 



21og2 



2m 

l-b\/2 



- 1 



( 11 ) 



where m = |"^] and the number of packets transmitted in the slow start phase 
can be expressed as 



tss ) + l 

2 5 



4tss(iyj) — 3 3"\/2 

8 - 2 - 



Nss{Wi) 



-b3-2 



2 



(12) 



608 



B. Sikdar and K.S. Vastola 



If w > m we also have a linear phase where the window increases linearly from m 
to w. The total time required by the flow to reach a window of w again following 
the timeout is thus 



r tss{w) + E[TO] + 1 if w < m 

\ tss{wi) + E[TO] + 2{w — m) else 



(13) 



Now, the probability that we have a loss in a round of size u following the 
timeout, before the window reaches w, PToiu,Wi), 1 < u < w, is given by 



ifu<TO 

Fto{u,w,) - <j _ (1 -p)2«) else 

Q(u)(l 



(14) 



Note that the 1 — (1 — p)^“ term in the second case has an exponent 2u because 
in the linear phase we have two consecutive rounds with the same window size. 
Then, the probability that there is another timeout before the window reaches 
a window of w is given by 



W — 1 

Ps(w,w^) = PTo{u,Wi) (15) 



Note that in the summation above some of the values of PTo{u,Wi) are zero if 
u < m and cwnd skips these values of u due to the exponential increase pattern. 
After the round, on an average 2 more round of packets are sent (where the 
first couple of losses may be recovered) before the timeout period begins. Thus 
if z > t — Dni{w,Wi) — E[TO] — 2, the probability that the off time is greater 
than t is given by 



P{T > t} 



(1 -p)^(*)(l - (p)“*)Q(zCj) ifz>/; 

(1 -p)^(*)(l - {p)^')Q{wi){l - Ps{h - z)) else 



where Ii = t — E\TO] — 2. The factor (1 — Pg(I; — z)) in the second case gives 
the probability that we do not have another loss before the window reaches w. 
It is absent in first case since i + E\TO] + 2 > t and we do not have to consider 
whether the packets following the timeout period are transmitted correctly or 
not. 

Case 4: When the retransmitted packet is lost. When the first retrans- 
mitted packet following a timeout is also lost, the retransmission timer backs off 
exponentially with a factor of 2 and can thus lead to very large silent periods. 
The duration of a sequence of n consecutive losses in lengths of E\TO] is given 
by 



j 2" — 1 for zz < 6 

(63-1- 64(rz — 6) else 



(17) 



Each of the losses following the initial loss indication occur with probability p. 
Also, the linear phase of the cwnd following the second loss begins after cwnd 
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reaches 2. Now consider the case when the flow experiences n loss indications, 
n — 1 of them being losses of retransmitted packets and that the first loss occurred 
after i rounds. Then, \i i > t — LnE[TO] — 2(w — 2) — 1 the probability that the 
off time for window w is greater than t is given by 

pfT ^ rt - / (1 - (1 if I > /; 

^ ^ ^ \(l-p)'^W(l-(l-p)“^)Q(u;i)p"-i(l-P«(/,-z))else 

(18) 

where /; = t — LnE[TO] — 2. The presence of (1 — Ps{Ii — i)) in the second case 
can be explained as before. 

Case 5: n isolated timeouts. Let us now consider the case where there 
are n isolated timeouts each of length E[TO]. After the first loss after i rounds, 
the slow start phase lasts till cwnd reaches m = l"^). The second loss occurs 
before cwnd reaches a values of w. The expected duration between the first and 
the second loss indications is given by 

f E[TO] -b 2 -b fer =2 uPto{u)^ iiw <m 

Di{wi) = \ E[TO] + 2 + (Er=” 2 ' uPto{u) else (19) 

[ +Er=i-K(w + 2(w- w) -0.5 )Pto(m)) 



In the above expression, the second summation in the second case corresponds to 
the linear increase phase where we have two consecutive windows with the same 
size. After the initial loss indication, each of the succeeding loss indications can 
occur at a window between 1 and ic — 1 . For each of these, we model the average 
duration between two successive losses by Di (w) . Also, the probability that there 
is another loss following first loss (before the window reaches w) leading to a 
timeout is given by Ps{w,Wi). Correspondingly, we model the same probability 
for all losses after the second loss by Ps{w,w). Also, after the last loss, it takes 
tss{w — 1) + 2{w — — 1 rounds for the window to reach a size of w. Since 

t — Di{wi) — i rounds comprise the duration for the rest of the losses following 
the first loss indication, we need at least 



t - Di{wj) - i' 
Di{w) 



-bl 



( 20 ) 



losses for the off time to exceed t. Then if n > 1 (the case n = 1 had already 
been considered) the probability that the off time is greater than t is given by 



P{T > t} 



(l-p)'^W(l-(l-p)“0QW 

Ps{w,Wi){Ps{w,w)Y~'^ 

(1 -p)^W(l - (1 -p)“*)Q(wi) else 

Ps{w,Wi){Ps{w,w)Y~'^{l - Ps{Ii - i)) 



where Ii = t — Di{wi) — {n — 2)Di{w) — E\TO] — 2. 

Case 6: Multiple cousecutive losses. We now consider the cases where 
there are n losses which are successfully recovered using a single timeout and I 
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(a) Ohio: p = 0.010 (b) Ohio: p = 0.078 (c) Ohio: p = 0.135 




(a) Pisa: p = 0.001 (b) Pisa: p = 0.006 (c) Pisa: p = 0.099 



Fig. 5. The lower bound on the tails of the ccdf for the Ohio and Italy traces for various 
values of w. The time t is in seconds. 



losses in which the retransmitted packet is also lost resulting in silent periods 
which are multiples of E[TO]. Let the I periods of consecutive timeouts be all due 
to j consecutive losses. The probability of each of these n periods is Ps{w, w)p^~^ 
and the probability of the single loss indications is Ps(w, Wi) and Ps{w,w) for the 
first and the rest of the n — 1 losses respectively. For a given n and I we can have 
a sequence corresponding of n + Z losses in t rounds only iit — Di (wi) — (n + I — 
l)Di{w)-l(2^-2)E[TO] < i < t-Di{wi)-(n+l-2)Di{w)-{l-l){2^ -2)E[TO]. 
For the values of i falling in this range, the probability that the off time is greater 
than t is given by 



P{T > t} 



(1 - (1 - p)^')Q{w^)Ps(w,Wi) iii> Ii 

(1 - (1 - p)^')Q{wi)Ps{w,Wi) else 

(P,(u;, «;))”+'■ - Psih - ^)) 



where h = t - Di(w^) - (n + I - 2)Di{w) - l{2^ - 2)E[TO] - E[TO] - 2. 



3.4 Numerical Results 

We now present the numerical evaluation for the lower bounds for the parameters 
from all the Ohio and the Pisa traces considered in Sectional In Fig.|3we show 
the ccdf for the various window sizes for both destinations. The heavy tailed 
nature of the tails is evident and as expected, the rate of decay reduces with 
increasing loss probabilities. Also, to see the impact of timeouts on the tails of 
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Table 1. Contribution of various losses to the ccdf. t = 200RTTs, w = 10, Wmax = 18. 



Type of 
Loss 


p = 0.100 


p = 0.001 


prob 


ccdf 


prob 


ccdf 


Case 1 


0.0000 


0.0000 


0.0304 


0.0304 


Case 2 


0.0000 


0.0000 


0.0000 


0.0304 


Case 3 


0.0000 


0.0000 


0.0123 


0.0428 


Case 4 


3.99E-4 


3.99E-4 


3.18E-6 


0.0428 


Case 5 


0.0116 


0.0120 


0.0156 


0.0584 


Case 6 


0.1306 


0.1426 


3.57E-5 


0.0584 



the ccdf, in Table Q we show the contribution to the tails by the various cases 
involving timeouts that we considered in the previous subsection. As expected, 
the contribution from the timeouts have a large contribution to the tails, specially 
higher loss probabilities. For very low loss rates, the contribution due to multiple 
losses is negligible and the tail is made of just 3-4 exponentials. For higher losses, 
the probability of multiple timeouts increases and we have a large number of 
exponentials with different rates the superposition of which leads to a heavy 
tailed distribution. 



4 Conclusions and Discussions 

In this paper we provided an explanation of how TCP can cause self-similarity 
in network traffic. Using traces of actual TCP transfers over the Internet, we 
showed that individual TCP flows, isolated from the aggregate flow on the link 
also have a self-similar nature. Our results also showed that the degree of self- 
similarity is dependent on the loss rates experienced by the flow and increases 
with increasing loss rates with the traffic no longer self-similar at very low loss 
rates. We then proposed a model explaining the contribution of TCP to traffic 
self-similarity. The model is based on considering each TCP flow as the super- 
position of a number of ON / OFF processes where the OFF times have a heavy 
tailed distribution. We verified the model empirically and then provided a Arm 
mathematical basis to the empirical observations of heavy-tailed distributions in 
the OFF times by deriving a heavy tailed lower bound on the ccdf. 

The loss rate experienced by a TCP flow is an important indicator of the 
degree of self-similarity in the network traffic. A natural construction of the 
extremely bursty nature of TCP traffic comes from timeouts which represent 
“silent” periods and separate periods of activity. Since a majority of loss in- 
dications under current Internet scenarios lead to timeouts, losses increase the 
burstiness and the heavy tails in the OFF times. The degree of self-similarity or 
H being dominated by the heaviest tail in the superposition, higher loss rates 
thus lead to higher values of H . In contrast when the loss rate is extremely low, 
TCP transmits Wmax packets in every round and behaves like a CBR source. 
Thus the bursty nature is absent at low loss rates and consequently the OFF 
times have an exponential tail with the traffic no longer being self-similar. This 
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explains the observations in Section El where flows with loss rates less than 0.001 
had a Hurst parameter of approximately 0.5. Our flndings show that the loss 
probability is a faithful indicator of the “network’s effect” on TCP traffic in 
terms of both the effects of superposition with other flows and the degree of 
self-similarity of the traffic. 

While TCP Reno is the most widely implemented version of TCP, other ver- 
sions of TCP are currently under research, the most notable amongst them being 
TCP SACK. TCP SACK provides robustness against multiple packet losses in a 
single window and recovers them without resorting to timeouts. However, it does 
not completely eliminate timeouts since it requires the receipt of K (usually 3) 
duplicate ACKs before the retransmission mechanism kicks in. Thus timeouts 
are inevitable for small windows and will be present even for larger windows for 
correlated losses. Consequently we expect self-similarity to be present in TCP 
SACK traces also, though the loss rates at which H > 0.5 will be greater than 
those for TCP Reno. 
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Abstract. Many bandwidth estimation techniques, somehow related to 
the TCP world, have been proposed in the literature and adopted to 
solve several problems. In this paper we discuss their impact on the con- 
gestion control of TCP and we propose an algorithm which performs 
an explicit and effective estimate of the used bandwidth. We show by 
simulation that it efficiently copes with the packet clustering and ACK 
compression effects without leading to the biased estimate problem of ex- 
isting algorithms. We present numerical results proving that TCP sources 
implementing the proposed scheme with an unbiased used-bandwidth es- 
timate fairly share the bottleneck bandwidth with classical TCP Reno 
sources. Finally, we point out the benefits of using the proposed scheme 
compared to TCP Reno in networks with wireless links. 



1 Introduction 

The Transmission Control Protocol (TCP) is based on the assumption that the 
network does not provide any explicit feedback to the sources. Therefore each 
source must form its own estimates of the network path properties, such as round- 
trip time (RTT) or usable bandwidth, in order to perform efficient end-to-end 
congestion control. 

The TCP congestion control has actually the twofold aim to prevent conges- 
tion events and achieve a fair share of bandwidth among different connections. 
Therefore, according to the guidelines in [Q and 0, it’s worth to define the 
available bandwidth as the maximum rate at which a TCP connection, exercis- 
ing correct congestion control, should ideally transmit, and the used bandwidth 
as the rate at which the source is actually sending data. 

The most widely deployed TCP implementations (TCP Reno and its exten- 
sions as SACKS PI or New Reno do not explicitly estimate the available 
bandwidth. Instead, the end-systems maintain two state variables to regulate 
the transmission rate: the congestion window (cwnd), which usually determines 
the transmission window, and the slow start threshold (ssthresh) that marks 
the cwnd value which discriminates between the slow-start and the congestion 
avoidance phases. At the beginning of the connection, as the ssthresh is set to 
a big value, the source exponentially increases the number of packets in flight 
(slow start) until the network drops packets, thus signaling congestion. In re- 
sponse to congestion TCP Reno sets the ssthresh to one half of the bytes in 
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flight, and rapidly enters congestion avoidance phase during which the cwnd is 
linearly increased. 

In (3 it has been shown that the general scheme of additive increase and 
multiplicative decrease (AIMD), on which the congestion control scheme of the 
TCP is based, leads to a fair share of the network bandwidth among different 
connections in an ideal scenario where all TCP connections take decisions in a 
synchronized fahion. So, ideally, the ssthresh gives an implicit estimate of the 
available bandwidth and the congestion avoidance is used to gently probe for 
extra bandwidth. 

Unfortunately, it is well known that in real scenarios TCP Reno fails to 
achieve fair allocation of the bandwidth among connections sharing the same 
bottleneck when the connections experiment different conditions on the end-to- 
end path (as for example path delays). For these reasons the ssthresh can be 
considered as an implicit estimator only of the used rather than the available 
bandwidth. 

Moreover, in TCP Reno, the implicit bandwidth estimate is strictly depen- 
dent on the congestion control events experienced by the connection. Therefore, 
as TCP Reno actually does an implicit estimate of the bandwidth it is using, 
we may ask whether it is worth performing an explicit run-time estimate of the 
used bandwidth and how this estimated value can be used by the congestion 
control scheme. 

Various bandwidth estimation techniques, somehow related to the TCP world, 
have been proposed in the literature and adopted to solve different problems 
pifil7l8tll1 nil 1| . In this paper we first review these techniques pointing out their 
impact on the behavior of the TCP congestion control (Section 2). We then pro- 
pose an algorithm which performs an explicit and effective estimate of the used 
bandwidth and show by simulation that it efficiently copes with the packet clus- 
tering and ACK compression effects without leading to the biased estimate prob- 
lem of the algorithm proposed in IIOIIII (Section 3). We show, however, that the 
best way to use the estimated value is to set the ssthresh to the byte-equivalent of 
the bandwidth/delay product only after congestion events as proposed in llOllll . 
Moreover, we present numerical results proving that TCP sources implement- 
ing the proposed scheme with an unbiased used bandwidth estimate fairly share 
the bottleneck bandwidth with classical TCP Reno sources provided that the 
end-to-end path conditions are the same. Finally, we point out the benefits of 
using the explicit used bandwidth estimate compared to the implicit bandwidth 
estimate of TCP Reno when wireless links are on the path. 



2 Estimation Techniques 

In a classical IP architecture, to provide best effort service the network resources 
must be shared by all flows in an as fair as possible way. A centralized controller 
could in principle regulate the rate of all flows to ensure fairness based on the 
knowledge of the number of flows and the routing paths. However, a central 
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controller is unfeasible and too far from the IP philosophy, so the network must 
somehow estimate the bandwidth availability in a distributed way. 

In Core Stateless Fair Queing (CSFQ) scheme P, bandwidth estimate is 
performed at the IP router level. The router, knowing the bandwidth Bk of 
its k-t/i outgoing link and the number Uk of active flows by means of packet 
classification, estimates by Bk/rik the bandwidth available to each flow. Through 
a run-time estimate of the bandwidth actually used by each flow, the router can 
decide to drop packets belonging to connections using bandwidth in excess, i.e. 
connections sending at a rate greater than the available bandwidth Bj~ /nu ■ This 
approach has been shown to solve problems of unfairness among connections 
having different round trip times, and can be the basis for mechanisms designed 
to regulate non TCP- friendly or unresponsive flows H21- 

CSFQ has the great advantage of forcing flows to fairly share bandwidth 
even when the congestion control mechanism of the transport protocol is not 
accurate. However, it requires relevant modifications in the IP routers and it 
cannot be easily deployed over the Internet. 

If only the end-systems are in charge of the rate regulation without any 
explicit support from the network, some kind of bandwidth estimate must be 
performed at the TCP level. Explicit bandwidth estimation algorithms have been 
proposed to be used by the TCP sources at the beginning of the connection. Their 
main goal is to set the first value of the ssthresh in order to mitigate the effect 
of multiple losses due to the high default value commonly used [Zj. Though the 
ssthresh should be set to the available bandwidth, most of the proposed schemes 
estimate the bottleneck bandwidth, a quantity which can be more easily tracked 
by analyzing the timing structure of received acknowledgments (ACKs). The 
“Packet Pair” algorithm 0 is based on the assumption that if two packets are 
sent with closely spaced timing, the interarrival time of the ACKs strictly reflects 
bottleneck bandwidth. However, as shown in Q, this technique performs poorly 
if implemented at the sender side, mainly due to the ACK compression m 
which alters the ACK spacing. Some variants of “Packet Pair” consist in tracking 
“Closely Spaced ACKs” (CSAs) 0171 . 

A more sophisticated bandwidth estimation scheme which runs throughout 
the connection has been adopted in TCP Vegas |S|. While TCP Reno relies on 
packet losses in order to estimate the available bandwidth of the network, TCP 
Vegas estimates the available bandwidth of the network based on the difference 
between the expected and the actual flow rate. The expected and actual rates 
are given by cwnd/baseRTT and cwnd/RTT, respectively, where baseRTT is 
the minimum RTT ever recorded by the TCP source and RTT its last value. 
By this mechanism, when the network is not congested, the actual flow rate is 
close to the expected one, while, when network is congested, the actual rate is 
smaller than the expected flow rate. 

TCP Vegas builds over this explicit and continuous bandwidth estimate a 
new congestion control scheme which leads to convergence of the congestion 
window to an equilibrium point. It has been shown, however, that even TCP 
Vegas fails to obtain a fair allocation of bandwidth especially in an heterogeneous 
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environment. Moreover, it is known that TCP Vegas is greatly penalized by the 
aggressive nature of TCP Reno, and so it receives very little bandwidth while 
Reno easily captures the rest M- Even the use of RED gateways while 
bettering the situation, fails to fill the gap between Reno and Vegas. Finally, 
in m it has been pointed out that even in a homogeneous environment, TCP 
Vegas may fail to achieve fairness, fundamentally due to the convergence to fixed 
but different values of the cwnd parameters of competing connections. 

TCP Westwood, recently proposed in imni . performs an estimate of the 
available bandwidth by measuring the returning rate of acknowledgments, and 
uses this estimate to set the ssthresh and the cwnd after congestion events such 
as the receipt of three duplicate ACKs or coarse timeout expirations. TCP West- 
wood uses this faster recovery mechanism to avoid the blind halving of the send- 
ing rate as in TCP Reno after packet losses. Therefore, this explicit bandwidth 
estimation scheme has a deep impact over the performance of TCP Westwood 
sources, especially in presence of random, sporadic losses typical of wireless links 
or with paths with high bandwidth/delay product. 

The bandwidth estimation algorithm performed by TCP Westwood as re- 
ported in im is described by the following pseudocode: 

if (ACK is received) 

sample_BWE [k] = (acked * pkt_size * 8) /(now - lastacktime) ; 

BWE[k]= beta*BWE [k-1] + (1 - beta) * (sample_BWE [k] + 

sample_BWE[k-l] )/2; 

endif 

Here, acked indicates the number of segments acked by the latest ACK, 
pkt-size indicates the segment size in bytes, now indicates the current time, 
lastacktime indicates the time the previous ACK was received, k and k-1 indicate 
the current and previous value of the variables, BWE is the low-pass filtered 
measure of the available bandwidth, and beta is the pole used for the filtering 
(in CH a value of beta = 19/21 is suggested). 

The basic idea of the proposed scheme is to low-pass filter the bandwidth 
signal to obtain an accurate estimate of the bandwidth not affected by spo- 
radic losses. Unfortunately, filtering directly the samples of BWE presents some 
drawbacks when the packet interarrivals are significantly different. The example 
depicted in Figure ^ shows a simple and typical situation where this scheme fails 
to correctly estimate the bandwidth: 



L L L 



L 




Fig. 1. Packet timing structure 
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Here L is the packet length expressed in bits, Tp the time interval between 
contiguous packets, T the total observed time. The bandwidth used by the con- 
nection is and for simplicity T = 9 * Tp. The algorithm above, however, 
estimates approximately the value as it averages the rates. By filtering we 
extract the average value of the rate, which is different from the used bandwidth. 
To be slightly more rigorous let the random variables Y and X represent the 
packet length and the interarrival time respectively. The average rate is given 
by: 

E[^]=E[Y]*E[^] (1) 

being X and Y independent. This value is in general different from the used 
bandwidth which is given by /ry//ia,, where = E[X] and fj,y = If we 

expand the function \ jX around the value /ij,, up to the third term, we obtain: 






My 



( 2 ) 



where = E\{X — HxY]- Even if the validity of the expression (j2I) is limited 

due to the approximations, it shows that the estimate is biased and the error 
depends on the variance of the interarrivals. 

Figure El shows the bandwidth estimated by one of 20 TCP Westwood con- 
nections performing the rate estimation algorithm and sharing the same 10 Mb/s 
bottleneck. Similar curves are observed for the other connections. The bottleneck 
queue was designed to hold a number of packets equal to the bandwidth/delay 
product. The one-way-RTT was 50 ms, the test lasted 600 simulated seconds 
to simulate an FTP session, and beta was set to 0.995. These results as all the 
others presented in this paper were obtained using the Network Simulator, ’ns’ 
ver.2 in). To provide a comparison, the dotted line represents the bandwidth 
estimated by a TCP Reno source running the DFT algorithm we propose and 
describe in detail in the next section. Since the fair-share value is 500 kb/sec, 
while TCP Westwood algorithm estimates more than 8 Mb/s we conclude that 
variance of packet interarrivals is quite large. 

The interarrivals would be almost regular if packets belonging to different 
connections could alternate on the channel. On the contrary, it has been shown 
that TCP transmissions tend to be clustered so that on a channel we usually 
observe many consecutive packets of the same connection H3|. Note that the 
bias on the bandwidth estimate does not depend on the value beta chosen for 
the pole of the HR filter. It’s easy to understand that any fixed value of the pole 
leads to the same problem. 

Filtering directly the rate measured considering the ACK arrival times also 
exposes the algorithm to the phenomenon of the ACK compression. This hap- 
pens when the time spacing between returning ACKs is altered due to congestion 
of the routers on the return path. As one or more ACKs spend some time in the 
queue of the congested router in the reverse path, subsequent ACKs may reach 
each other and their original spacing is lost. It has also been shown that ACK 
compression is quite relevant for real networks operation PS), and therefore it 
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Fig. 2. Bandwidth estimated by TCP Westwood 



cannot be neglected. To evaluate the impact of the ACK compression on the al- 
gorithm we have considered a scenario with two connections sharing the same 10 
Mb/s bottleneck link, but transmitting data in opposite directions, as described 
in m- The end-to-end propagation delay was 100 ms, and the bottleneck queue 
could contain a number of packets equal to the bandwidth/delay product. The 
two routers at each end of the bottleneck are therefore charged with both pack- 
ets from one connection and ACKs of the other, thus leading to the situation 
of ACK compression described before. The results show that the impact on the 
estimate of TCP Westwood is dramatic since the estimate is about 25 times 
higher than that shown in Figured 

Finally, we point out that an algorithm implemented in the TCP source 
according to the TCP Westwood approach can estimate the used bandwidth and 
not the available bandwidth. Therefore, the benefits of the new scheme are not 
due to the estimate of the available bandwidth which cannot be estimated end- 
to-end, but on the possibility to explicitly estimate the used bandwidth taking 
into account the short-medium term history of packet arrivals. Moreover, the 
bandwidth which the algorithm tries to explicitly estimate is the same implicitly 
considered by TCP Reno and reflected by the ssthresh in steady-state conditions. 
So, if the estimate is accurate enough we expect that the bandwidth used by 
TCP Reno and by a TCP exploiting a bandwidth explicit estimate are almost 
the same. This can provide a fair behavior in homogeneous (all sources using the 
algorithm) and heterogeneous scenarios (with also classical TCP Reno sources) . 
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3 Double Filtering Technique 

In this section we present a new technique, the Double Filtering Technique 
(DFT), which using the basic idea of TCP Westwood succeedes to obtain correct 
estimates of the bandwidth used by the TCP source. 




T 

Fig. 3. Packet timing structure 



To explain the rationale of DFT let us refer to the example in Figure0 where 
transmissions occurring in a period T are considered. Let n be the number 
of packets belonging to a connection and Li, L 2 ...Ln the lengths, in bits, of 
these packets. The average bandwidth used by the connection is simply given 
by ^ X^r=i define L = ^ express the bandwidth (Bw) 

occupied by the connection as: 



nL L 
Bw = — = ^ 



( 3 ) 



The basic idea is to perform a run-time sender-side estimate of the average 
packet length, L, and the average interarrival, separately. Following the TCP 
Westwood approach this can be done by measuring and low-pass filtering the 
length of acked packets and the intervals between ACKs’ arrivals. However, since 
we want to estimate the used bandwidth we can also low-pass filter directly the 
packets’ length and the intervals between sending times. 

Note that sending time intervals can be very very short when groups of 
packets are generated by TCP sources. However, this is not a problem for DFT 
since the estimate is performed directly on the interarrival samples. Different 
would be the case for algorithms that filter the bandwidth samples, such as 
TCP Westwood, since these samples are close to infinity. 

The pseudocodes of the two bandwidth estimation schemes are the following: 
l)Processing the stream of sent packets: 



if (Packet is sent) 

sample_length [k] = (packet_size * 8); 
sample_interval [k] = now - last_sending_time ; 

Average_packet_length[k] = alpha * Average_packet_length [k-1] + 

(1-alpha) *sample_length [k] ; 

Average_interval [k] = alpha * Average_interval [k-1] + 

(1-alpha )* sample_interval [k] ; 
Bwe [k] = Average_packet_length [k] / Average_interval [k] 
endif 
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where packetsize indicates the segment size in bytes, now indicates the cur- 
rent time, last-sending-time the time the previous packet was sent, k and k-1 
indicate the current and previous values of the variables. Average-packetJength 
and average-interval are the low-pass filtered measures of the packet length and 
the interval between sending times. Alpha is the pole of the two low-pass filters. 
Bwe is the measure of the available bandwidth. 

2)Processing the stream of received ACKs: 

if (Packet is received) 

sample_length [k] = (acked * packet_size * 8); 
sample_interval [k] = now - last_ack_time ; 

Average_packet_length[k] = alpha * Average_packet_length [k-1] + 

( 1-alpha) *sample_length[k] ; 

Average_interval [k] = alpha * Average_interval [k-1] + 

(1-alpha )* sample_interval [k] ; 
Bwe [k] = Average_packet_length [k] / Average_interval [k] 
endif 



where the quantities are the same as before. Here, acked indicates the num- 
ber of segments acked by the latest ACK. In order to compute this value, the 
algorithm shown in m] must be used. 

If we consider the minimum RTT measured by the TCP source (RTTmm) as 
a good estimator of the end-to-end propagation delay, then we can set: 



Ssthresh = Bwe * RTTmin (4) 

The ssthresh is set to the value of equation only after three duplicate 
ACK’s, or after a coarse-grained timeout expiration, following the guidelines 

ofini- 

Simulation results show that DPT is not biased, and obtains bandwidth 
estimates which oscillate around the fair-share value when all TCP sources ex- 
perience almost the same path conditions. In order to smooth these oscillations 
and ensure an estimate closer to the right value, we propose to further filter the 
value of Bwe as follows |S|: 



^ -r[k] ^ AveragejpacketJength[k] 

Average-interval [fc] 



-rW 



* Bwe[k — 1] (5) 



where T[k] is the instantaneous time interval between two estimates and Tq is a 
time constant we set equal to 1 second in our simulations. By binding the value of 
the pole to T[k], we perform an adaptive filtering which exploits the oscillations 
of the signal Bwe in order to quickly follow variations in the available bandwidth. 

Figures and Eh show the behavior of DFT without and with the filtering 
performed by Equation (EJ, respectively. For both the figures the scenario con- 
sists of a single TCP connection running over a 10 Mb/s link. In the interval 
between 200 and 300 seconds, an UDP flow, having the same priority as TCP, 
transmits at a rate of 4 Mb/s. Then, in the interval between 300 and 400 seconds. 
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(a) 



(b) 



Fig. 4. DFT bandwidth estimate (a) without adaptive filtering (b) with adaptive fil- 
tering 



another UDP flow starts transmitting at 2 Mb/s. The bottleneck queue, man- 
aged with a drop-tail policy, was designed to contain a number of packets equal 
to the bandwidth/delay product, and the simulation lasted 600 simulated sec- 
onds. In FigureEt^, the oscillations are evident, even if the estimate is unbiased. 
Moreover, the algorithm adapts slowly to changes in the bandwidth available 
to the connections. In Figure Eb, instead, the oscillations have been smoothed, 
and the estimate follows more quickly bandwidth variations in the underlying 
network. 

Following the approach in El we set the ssthresh to the estimated value 
only after congestion events. This choice is supported by the worst performance 
observed with more frequent updating of the ssthresh value as shown in Figured! 
The scenario and the parameters considered are the same as in Figure 0, but the 
ssthresh is continuously updated. We observe that the estimate is not accurate 
mainly because the continuous updating of the ssthresh to the estimated used 
bandwidth value forces the source in the congestion avoidance phase and prevents 
to follow available bandwidth variations. Similar results have been obtained with 
a periodic updating with period equal to 0.5 s. 

Figure El compares the performance of DFT algorithm (the version Altering 
the stream of sent packets), and the TCP Reno, referring to a simulation scenario 
that considers 10 connections sharing a single bottleneck link of 10 Mb/s with 
an end-to-end delay of 100 ms. The buffer contains a number of packets equal 
to the bandwidth/delay product, and FIFO queueing management is adopted, 
to test DFT even in absence of a somewhat fair queueing. Several simulations 
have been run and the results have been averaged in order to eliminate phase 
effects m- We have numbered the connections from 1 to 10: the first 5 used DFT 
algorithm, the other 5 TCP Reno. We observe that an almost fair division of the 
link has been obtained and both algorithms use almost the same bandwidth. 
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Fig. 5. Bandwidth estimate with continuous ssthresh updates 




Connection Number 



Fig. 6. DFT fairness towards TCP Reno in a 10 Mb/s bottleneck (alpha = 0.99) 



We have run also simulations over different scenarios covering link band- 
widths ranging from few kb/s to 150 Mb/s, varying the number of competing 
connections and using also a more complex topology with multiple congested 
gateways. The conclusions obtained are the same: DFT obtains a no worse level 
of fairness than TCP Reno. Simulation results also show that the strength of 
DFT lies in its scalability: as more connections share the bottleneck link, as the 
estimate variance reduces. The presence of constant rate flows, such as UDP 
flows for IP telephony or video conference, makes DFT perform better as it 
reduces the dimension of packet clusters. 
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So far we have proved the accuracy of the DFT algorithm and shown that 
TCP sources using this algorithm are fair to other sources. To complete the 
performance evaluation we need to verify the ability to achieve high throughput 
in presence of links affected by sporadic losses as achieved by TCP Westwood. 




Fig. 7. DFT and RENO throughput vs link error rate 



To this purpose, in Figured we compare the throughput achieved by a con- 
nection running the DFT algorithm to that of a TCP Reno connection over 
a link with random errors. The link has a capacity of lOMb/s, and the FIFO 
queue can contain a number of packets equal to the bandwidth/delay product. 
The one-way RTT is 50 ms, and the link drops packets according to a Pois- 
son process with average ranging from 0.01% to 1%. We observe that DFT can 
sustain higher throughput than TCP Reno at all drop rates considered. This 
is due to the filtering process which keeps in account also the past history of 
the bandwidth estimates avoiding to confuse network congestion signals due to 
queue drops with losses due to link errors. 

4 Conclusions 

In this paper we proposed the DFT algorithm which performs an explicit and 
effective run-time estimate of the used bandwidth of a TCP source. It is based 
on separate filtering of both the intervals between sending times of TCP packets 
and the packets’ lengths. 

Following the approach of TCP Westwood, we used the estimate to set the 
ssthresh after congestion events. We proved by simulation that the accuracy of 
the estimation algorithm allows the proposed scheme to be fair with TCP Reno 
connections sharing the same bottleneck channel. As a result it is suitable to be 
gracefully adopted in the IP world with no coexistence problems. In addition 
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the new scheme, differently from TCP Reno, is effective to cope with channel 
random errors as it occurs in wireless environments. 
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Abstract. This paper deals with a Differentiated Services test-bed, de- 
veloped in Alcatel Labs in order to evaluate and compare the perfor- 
mance offered by three different queuing policies, namely FIFO, Priority 
Queuing and Custom Queuing, when supporting Voice over IP traffic. 
A Premium class has been assigned to voice traffic, while Best Effort 
class traffics have been considered as disturbing traffic and they have 
been generated by means of suitable TCP and UDP software tools. Per- 
formance comparisons have been developed for the considered queuing 
schemes. In particular. Priority Queuing has resulted to be the best solu- 
tion for Premium class quality in the developed test-bed. For this scheme 
a suitable analytical approach has been also proposed and validated by 
comparing analytical predictions with experimental results. 



1 Introduction 

The present Internet is widely known as a best-effort network, meaning that it 
is generally not possible to guarantee a pre-defined quality of service over IP 
protocols. However, a general trend in communication networks is the attempt 
to provide packet services with the typical guarantees of a switched network. In 
this direction, IP Telephony is one of the most promising real-time technologies 
over IP networks P 0. In order to support voice services on the Internet, many 
Quality of Service techniques are being extensively studied [3j, and all of them 
are trying to differentiate the treatment that flows receive from the network 
depending on their priority level. This paper deals with the Differentiated Ser- 
vices technique, where flows are treated with different policies depending on the 
Differentiated Services Code Point (DSCP) written in their IPv4 Type of Ser- 
vice (TOS) field 0. We have implemented a Differentiated Services test-bed in 
Alcatel Labs, Florence, Italy, in order to evaluate the performance of resource al- 
location techniques under consideration. We have developed several tests where 
a telephone call was established between two users, with traffic profile of the 
so-called Premium class. The quality provided to the Premium class, in terms of 
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perceived intelligibility, has been verified through opinion score tests. Best Effort 
traffic has been generated in order to test the Quality of Service degradation un- 
der an increased network congestion level. Several queuing techniques, like FIFO, 
Priority Queuing (PQ), and Custom Queuing (CQ), have been implemented in 
the network nodes, and their effects on the overall system performance have 
been evaluated. This paper is organized as follows: a brief review of the available 
QoS management strategies is presented in Sec. where some main issues con- 
cerning the overall characteristics of the chosen queuing strategies are discussed. 
In Sec.0 the implemented test-bed is presented in detail, while describing the 
developed experimental tests. Then in Sec. 0 we have discussed an analytical 
model of the router queue when using PQ, which, as the preliminary evaluation 
given in Sec0 has outlined, seems to be the best solution in the scenario of this 
work, in particular in case of low voice traffic amount. We have used the typical 
M/G/1 queues analysis methods with priority differentiation 0 , and an aver- 
age packet delay time has been evaluated through this method for a generic IP 
packet crossing the network while PQ is used on the router queues. In Sec. El 
a more detailed performance comparison between the experimental tests, when 
using FIFO, CQ and PQ, is done, and some general observations are given con- 
cerning their behavior in the test-bed. Then, the obtained analytical predictions 
have been verified by comparison with experimental results. Finally, in Sec El our 
conclusions have been discussed. 

2 QoS Management Schemes for IP Networks 

In order to support real-time services over IP packet networks, and to let delay- 
sensitive and best-effort traffic coexist in the same network, some main per- 
formance parameters must be kept under control through suitable management 
strategies. According to typical Quality of Service studies, three parameters seem 
to be critical for real-time traffic support over packet networks, namely: the 
packet loss H, the one-way delay time [7| and the instantaneous voice packets 
delay variation (IPDV) |Sj. Typical values for these parameters depend on the 
considered application, and on the requested service level agreement. However, 
for medium-size IP networks, an acceptable QoS for Voice over IP services can 
be accomplished by having almost 10%, 150ms and 40ms as maximum values 
for, respectively, the three QoS parameters described above. 

Towards this goal, many bandwidth reservation strategies have been studied in 
the last decade. The two main approaches in the IP QoS field are the Integrated 
Services (IntServ) and the Differentiated Services (DiffServ) architecture The 
IntServ model is commonly referred to as a per-flow architecture, where in each 
network node a particular bandwidth amount is reserved to each traffic ffow be- 
tween two end-points. In order to exchange bandwidth reservation information 
between routers, a signaling protocol is requested, which is the Resource Reser- 
vation Protocol (RSVP) Pj. When the number of nodes, and users, is growing 
up it is indeed difficult to implement this model on the limited memory of net- 
work routers, where for each flow a specific bandwidth reservation policy has 
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to be recorded and periodically refreshed, and therefore the amount of per-flow 
information increases proportionally with the number of flows to be managed 
by the IntServ network. Then, the IntServ model is said to be generally non- 
scalable to large networks, and a possible solution has been proposed in the 
DiffServ architecture. In DiffServ a per-behavior treatment is adopted in each 
router, meaning that a particular behavior is associated to each traffic type, and, 
more specifically, to each QoS traffic class, by marking its IPv4 TOS field with 
a proper DSCP. Each router will then read the TOS field of each packet, it will 
classify it and then it will treat the packet according to the classified behavior, 
reserving a corresponding bandwidth amount to that packet. This can for exam- 
ple be achieved by providing several queues in the router instead of the single 
FIFO queue. If we associate at each queue a different priority, we can obtain a 
final overall differentiated treatment for each packet. Thus, it will be sufficient to 
record in each router a simple information on the behavior associated with each 
DSCP, and even though the network size grows, the same amount of memory 
will be requested to the network routers. This approach seems to be much more 
scalable than the IntServ model, even if a less dynamic bandwidth allocation 
is achievable with DiffServ than with IntServ. Many other further studies have 
been developed in the last five years, intended to solve the problems of each 
approach, but since the main concept of per-behavior approach seems to be the 
most promising among the proposed solutions, we have chosen to implement a 
DiffServ-like network, and to reserve bandwidth resources according to the TOS 
field of the transmitted IP packets. In each network node, after being classi- 
fied, either via a DiffServ or an Intserv method, the IP packet meets generally a 
node internal queue, where a proper treatment must be reserved to it according 
to the classification results. Towards this goal, we have considered two queu- 
ing approaches that are currently implemented in many commercially available 
routers, namely the Priority Queuing - PQ - and the Custom Queuing - CQ - 
methods m- In these queuing strategies, an input flow is classified via, for ex- 
ample, the TOS field, and then a certain number of virtual queues is created in 
the router, each one with the proper priority, and the output scheduler will read 
packets from a queue or from another at a rate depending on their priority level. 
For PQ, two queues are created, one with higher and the other one with lower 
priority, and delay-sensitive traffic is obviously addressed to the higher priority 
queue, thus receiving a much lower delay in the router crossing. In CQ, many 
queues can be created, each one with a specific priority level, in order to let many 
services have their fair bandwidth share, thus avoiding congestion situations. A 
router internal queue is assigned to each packet class, and the output scheduler 
reads packets in a round-robin mode. To each packet class, and therefore to each 
queue, a certain fixed bandwidth amount is reserved, and the output scheduler, 
by reading the corresponding bytes number from each queue, will reproduce in 
the output port the decided bandwidth reservation scheme. It is evident that 
PQ is better performing than CQ when high QoS performance are required for 
the Premium class Voice over IP traffic flow. In fact, as it will be verified in 
Sec0 with PQ it is possible to achieve much lower delay variation spreading 
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when network congestion level is high, since in CQ a certain bandwidth amount 
is delivered to Best Effort traffic in any case. PQ can therefore be a good choice 
when a high priority Premium class traffic must be served with excellent QoS pa- 
rameters, while CQ seems to be more conserving and less flexible when also Best 
Effort type traffics are to be maintained within acceptable QoS features. Since 
the aim of this work was to evaluate suitable QoS algorithms to provide VoIP 
traffic of modest entity with acceptable quality of service, PQ will be proven to 
be the best choice for our scopes. 

In the sequel, the test-bed used for our experiments will be described in 
detail, and measurement conditions will be highlighted. 

3 Test-Bed Description 

In order to study and evaluate the performance of a VoIP traffic over a DiffServ- 
like network, we have developed a test-bed in the Alcatel Florence laboratories, 
and we have used standard devices in order to emulate the traffic behavior of 
a common backbone IP network. As shown in Fig. ^ a backbone network has 
been emulated through three Cisco routers, namely two 2600 with VoIP ports 
acting as access routers, and one 4500 acting as the backbone router. Ether- 
net interfaces have been used to connect the routers. Two groups of PCs and 
workstations have been linked through switches and hubs to the access routers, 
thus emulating the connection between two larger LANs via an IP network. Two 
analog phones have been plugged into the two corresponding Cisco 2600 routers, 
where users were able to communicate via a Voice over IP service. The com- 
munication between the two analog phones has been chosen to be the Premium 
class service, to which the suitable priority algorithm will deliver the required 
bandwidth amount. According to the DiffServ architecture, a specific DSCP will 
be written in the TOS field of the IPv4 packets generated by the two VoIP ports, 
thus allowing the three routers to reserve a particular treatment to packets hav- 
ing this DSCP. The PCs and workstations have been used to generate disturbing 
traffic, namely TCP traffic via the TTCP software UDP traffic via the 

MGEN software m. Disturbing traffic, either of a TCP type or of an UDP 
type, is considered as Best Effort class traffic, and then its packets will be prop- 
erly classified by the three routers through the corresponding DSCP codes. The 
general step sequence of our experiments is described hereafter. First, while the 
telephone call is established, we have gradually increased the congestion level of 
the network, and we have measured the corresponding QoS parameters degrada- 
tion in the end-to-end connection by using the default FIFO configuration in the 
three routers queues. Then, we have implemented priority managing algorithms 
in the routers queues, and we have measured the overall system performance in 
terms of the chosen QoS parameters, while making voice quality intelligibility 
tests. For CQ, we have chosen two modes of operation, by using two different 
byte counts values for the two CQ queues irni. In particular, in CQ-I we have 
reserved a 10% bandwidth amount to the Premium class telephone traffic, which 
for our purposes was far enough for a single-connection voice call, and the re- 
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Fig. 1. Test-bed in Alcatel Labs 



maining bandwidth to Best Effort traffic. Then, in CQ-II we have reserved a 1% 
bandwidth to Premium traffic, leaving the remaining bandwidth to Best Effort 
traffic. Since our goal is to provide a low amount of VoIP traffic with an accept- 
able QoS, while letting Best Effort class traffic be delivered without performance 
degradation, the more flexible features offered by PQ has proven to be the best 
solution in comparison with the fixed bandwidth allocation offered by CQ. For 
this reason, since our network supports a low amount of voice traffic, PQ seems 
to be the best solution, in the next section we have proceeded in the study of an 
analytical model of a PQ router, which will be used as a theoretical comparison 
for the experimental tests. 



4 Analytical Model 

A functional model for the internal queuing system of a router is shown in Fig. 
El where its I/O interfaces are modeled as input and output queues, with an 
internal queue representing the CPU processing time. In order to make a realistic 
modeling, we have considered both the traffic directions that the router faces. 
We consider the voice Premium traffic coming from the telephone (input line 
1), the Best Effort UDP and TCP traffic coming from the LAN (input line 3), 
and the corresponding voice and Best Effort traffic coming from the other trunk 
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Low Priority Queue 



Fig. 2. Router model for analytical performance evaluation 



of the network, which represents in this case the backbone network. The output 
line is therefore the output port of the router facing the backbone. According to 
Fig. El at each input line, a classifier is modeled, whose aim is to read TOS field 
in IP packets, and to re-route them in the correct queue. Two internal queues 
are considered, one for the high priority Premium traffic, and the other for Best 
Effort traffic. We can model the whole system in Fig. El as an only M/G/1 type 
queue with differentiated serving time |S| , with two priority classes managed in a 
non-preemptive way. Since our goal is to calculate the average delay time spent 
by a general IP packet in the router queue, we can define the average total time 
spent in the router from a general IP packet as: 



t'router — Tclassifier t'queue T ^scheduler t^CPU T t'codec+pack 



( 1 ) 



where Tdassifier and Tgcheduier Can be neglected, since they are several orders of 
magnitude lower than the considered delay values, Tcodec+pack-, due to codec and 
packetization processes, can be assumed as almost 6ms for G.711 codecs m 
Tcpu is the required service time for a packet to be processed in the output 
queue, and therefore it corresponds to the GPU processing time for each packet. 
Tqueue IS the average delay time spent by each packet in the priority queuing 
system, and it will be addressed in the following as W. The average service time, 
for respectively Premium and Best Effort packets, is given by: 

A, = -^^,* = 1,2 (2) 

-^rate 

where Li is the average bit length of a Premium or Best Effort packet, and Lrate 
is the average bit rate of the input line expressed in bit/s. In the following, we 
will denote with the suffix ”1” the Premium Glass packets, and with the suffix 
”2” packets belonging to Best Effort traffic. In order to develop the analytical 
theory for our system, we must first investigate under which assumptions the 
considered router is a stable queuing system. To achieve this goal, we can first 
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evaluate the total arrival rate for Premium and Best Effort type traffic, given in 
packets/s: 

Atot = Ai -l- A 2 (3) 



We then can calculate the arrival probabilities for Premium and Best Effort 
packets, given by: 



Pi = 



Ai 

^tot 



(4) 



and 



_ A2 

^tot 



(5) 



The average service time for both Premium and Best Effort packets is then equal 
to: 



X = PiXi + P2X2. 



( 6 ) 



The average total traffic utilization factor for the considered queuing system is 
therefore: 



p — XXtot 



(7) 



In order to guarantee system stability, it must be p < 1, which happens, for the 
considered architecture, when the network congestion level remains below the 
75%. Once we have verified the limit for the system stability, let us proceed in 
the mathematical evaluation of the average delay time spent by IP packets in 
the router queues. If we call Wi the queuing time experienced by a premium 
class packet (high priority), and W 2 the queuing time of a Best Effort packet 
(low priority), we have that the average waiting time spent in the high priority 
queue is: 

+ N,X, (8) 



where Rm is the residual average time required to finish the processing of the 
packet that was already served in the queue (the system is a non preemptive 
queuing system), X\ is the average service time of Premium class packets, given 
in ( 0 , = XiXi is the average number of Premium class requests in the queue, 

with Ai being the average arrival rate for Premium class packets. Equation (EJ 
can be then rewritten as: 



Wi = 



Rn 



Rn 



(1 — AiXi) 1 — Pi 



(9) 



where pi is the router utilization factor due to Premium class packets. We can 
then proceed to calculate W 2 as: 



W2 = Rm + PiW\ -\- P2W2 + P1W2 



( 10 ) 



where piW\ is the average time spent to serve the Premium class requests stored 
in the queue, P 2 W 2 is the average time required to serve the Best Effort class 
requests, and P 1 W 2 is the average time needed to satisfy the Premium class 
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requests which arrived while the Best Effort packet was being served. Following 
the same method used in © we have: 



W2 = 





[(1 - pi)(l - Pi - P2)] 



( 11 ) 



In order to calculate Wi and W2 from m and (HU, we must explicit Rm- Fol- 
lowing a well-known approach we obtain: 



1 2 



( 12 ) 



which allows us to derive a final numerical value for Wi and VF2. Recalling eq. o, 
we obtain a total average delay time both for Premium class packets and Best 
Effort class packets in the router. We then have: 



^router Premium — IFl “t“Fl;ociec-t-pac/c — 



and 



lEkMi 

2 (i-Pi) 



“t"Fl;ociec-t-pac/c (13) 



1 'Ylr 

Trouter BestE f fort ~ ^^2~^X2~\~Toodec+pack — ^ p 



-P1-P2) 



~^X2~\~Tcodec-\-pack 

(14) 

which are all known quantities. We then have evaluated the average delay time 
spent by Premium class and Best Effort class IP packets in each router. To 
achieve the end-to-end delay, according to the implemented network shown in 
Fig.lU we need to sum up the network delays encountered from voice packets in 
the network and, in a first approximation, since the routers are three, and each 
of them is associated with similar queuing techniques, we can assume that the 
total end-to-end delay experienced by an IP packet is given by three times the 
values given in and d, respectively when the packets belong to Premium 
and Best Effort traffic classes. This is only a first rough approximation, but we 
will verify its practical applicability, and the single-router delay above-described 
evaluation, in the next section, where the experimental measurements will be 
compared to analytical predictions. 



5 Experimental Results and Performance Evaluation 

In order to study and evaluate the performance of the considered network, we 
have implemented many tests, by using the three above mentioned queuing al- 
gorithms, and by then observing the overall system performance in the various 
settings. We have increased the network congestion level by using MGEN and 
TTCP, respectively to generate UDP and TCP disturbing traffic. We have then 
started a telephone connection between the two analog phones, and we have ob- 
served the main QoS parameters values, with FIFO queuing management, with 
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Fig. 3. Packet Loss for Premium Class traffic 
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Fig. 4. Packet Loss for Best Effort Class traffic 



PQ, and with the two operational modes CQ-I and CQ-II, described in Sec. El 
For each of the following measurements, a G.711 codec is used for voice coding 
and decoding at the Cisco 2600 routers. Fig. Olshows the packet loss experienced 
by Premium class packets in the network crossing, versus the network conges- 
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tion level. As it is shown, the use of the described queuing strategies fixes the 
voice packets loss on merely noticeable percent values, even though for FIFO 
and CQ-II strategies we observe higher packet losses. Fig. 0 shows the packet 
loss for Best Effort class packets. Here, the packet loss obviously increases with 
the network congestion level, and it is evident that CQ-I shows the worst perfor- 
mance; this difference with the other queuing policies is due to the large amount 
of resources that CQ-I reserves to voice packets, that obviously penalizes Best 
Effort class traffic flows. We have then focused on the total average end-to-end 
delay that Premium and Best Effort packets received in the network, due to the 
router internal queues. Fig. shows the delay behavior versus network conges- 
tion level, depending on the implemented queuing strategy, for Premium class 
traffic. It is evident how, by using any of the QoS management methods, instead 
of FIFO, the average end-to-end delay remains under 50ms, a very acceptable 
value for such a network, and in general for VoIP performance standards, as 
described in Sec0 Fig. 0then shows the overall end-to-end delay received from 
Best Effort class traffic, which reaches unacceptable values when the network 
congestion level goes beyond the 75% value, which is, as described in Sec. 0 
the corresponding maximum throughput level for each router. It can be noted 
that, in this case, the different queuing policies have given similar results, and 
this is due to the fact that, although CQ-I causes high packet losses as shown 
in Fig. 0 when Best Effort packets are accepted into the router they are given 
the same delay in the lower priority internal queue, for every queuing policy. We 
then compare the overall average end-to-end delay calculated with the analytical 
model described in Sec. 0 with the experimentally evaluated delay in the test- 
bed. Fig. 0 shows the performance comparison for Premium class traffic, while 
Fig. 0deals with Best Effort class traffic. It is evident how our analytical model 
is in both cases well suited to the experimental measurements and, due to its 
worst-case nature, it can be considered as a reference in feasibility analysis. As 
for the packet delay variation (IPDV) 0, which is the third main QoS parameter 
that must be controlled for VoIP services, as described in Sec0 due to analytical 
complexity in deriving this term in a closed form, we have resorted to report here 
only experimental measurements, and in Fig. 0 Fig. 021 and Fig. HD we show 
the measured performance for the delay variation received from Premium class 
voice packets using, respectively, FIFO, PQ and CQ-I queuing strategies. In such 
figures, it can be noted that the reduction of the standard deviation is an evident 
proof of the better performance of CQ-I, and especially of PQ, in the considered 
network. CQ-II performance was not reported here due to its completely unsat- 
isfactory performance with regards to IPDV. It is evident that PQ outperforms 
both FIFO and CQ-I alternatives, due to the fact that, by giving always priority 
to Premium class traffic, it sensibly reduces the overall delay variation received 
by voice packets at the receiving router buffer. Thus, PQ seems to be the best 
solution also to reduce and control this third critical QoS parameter, thus al- 
lowing VoIP services to receive a good overall performance. Further extensions 
to the proposed architecture of Fig. 0 where more than one voice call flows 
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Fig. 5. Average end-to-end delay time for Premium Class traffic 
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Fig. 6. Average end-to-end delay time for Best Effort Class traffic 



are considered as Premium class traffic, are presently under development in the 
Alcatel labs in Florence. 




Fig. 8. Performance comparison for average end-to-end delay time for Best Effort Class 
traffic with PQ 
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Fig. 9. Delay variation received from Premium voice packets with FIFO queuing 
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Fig. 10. Delay variation received from Premium voice packets with PQ 



6 Concluding Remarks 

In this paper a Differentiated Services IP network with Voice over IP traffic 
support have been considered, which has been implemented in Alcatel Labs, 
Florence, Italy. Three queuing algorithms have been used in the router queues, 
namely FIFO, Priority and Custom Queuing, in order to reserve a suitable band- 
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Fig. 11. Delay variation received from Premium voice packets with CQ 



width amount to Premium class voice over IP traffic, while letting Best Effort 
class traffic operate without interrupting its service, and by reserving it the same 
bandwidth amount as in the FIFO queuing management case. We have devel- 
oped experimental measurements of packet loss, average delay time and delay 
variation received from Premium and Best Effort class packets during the net- 
work crossing, while increasing the network congestion level through disturbing 
traffic generators. Since for the considered network Priority Queuing has resulted 
to be the best solution, we have then developed an analytical estimation of the 
average delay time for a router using PQ, and we have compared the analyti- 
cal predictions with the experimental measurements results, thus verifying the 
theoretical analysis. 
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Abstract. This paper aims at shading some light on the concept of 
Service Level Agreements (SLAs) and on their usefulness in the context 
of the so-called Premium IP Networks. Such networks provide users with 
a portfolio of services, thanks to their intrinsic capability to perform a 
service creation process while relying on a QoS-enabled infrastructure. 
We will introduce a definition of SLAs and we will then focus on an 
actual example, namely the negotiation and management of QoS-aware 
Virtual Private Networks (VPNs). We believe that VPNs are a significant 
application due both to their importance in corporate scenarios and to 
the high revenues they guarantee to service providers. We will discuss 
the issues related to the effective SLA-based management of resources 
in those cases where the need arises for an entity that is capable of 
optimizing resource utilization in the presence of network infrastructures 
shared by a community of users. Finally, a novel component, named SLA 
Manager, that accomplishes these tasks, will be presented. 



1 Introduction 

QoS has been in the last years one of the major research topics in the networking 
community. First in Academia, then in Industry, the issues related to the provi- 
sion of guarantees in the performance achievable when offering communication 
services have been subject to an intense debate that is still continuing in various 
fora (as for example the IETF). 

It should be noted, however, that most of such activity has been focused 
on the technological aspects of the assurance of QoS within network elements, 
nodes and terminals. Most of the work has indeed been performed in the area of 
the mechanisms and architectures required for assuring that in packet switched 
networks data belonging to certain flows can be differentiated from others. This 
analytic approach has led to the definition of basic mechanisms and standards to 
be adopted within and across network elements. We might define such results as 
the basic elements for the provisioning of QoS “at a low level” , or also micro- QoS. 

In our opinion, however, this path towards technological development is miss- 
ing a critical evolution factor, i.e. the definition of technologies for a system-wide 
approach to the provision of QoS-aware communication services. These technolo- 
gies should be the ones responsible for the provisioning of QoS “at a high level” 
across complex network infrastructures with a real process and business model 

S. Palazzo (Ed.): IWDC 2001, LNCS 2170, pp. 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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oriented philosophy. We will define them as technologies for macro-QoS provi- 
sioning. 

Currently, from a Network Operator point of view the task of creating a QoS- 
aware communication infrastructure is obliged to be simply that of assembling 
new, advanced network components and adding such new infrastructure to the 
existing ones, trying to maintain and re-use the existing business models and 
architectures. Manufacturers of network equipment introduce continually new 
solutions and products characterized by conformance to the existing or proposed 
standards and recommendations for what concerns the micro-QoS issues, but at 
the same time adopting specific and proprietary solutions for the provisioning 
of macro-QoS functionality. 

The aim of this paper is to bring theoretical and practical contributions 
exactly in this area, with the goal of allowing the definition of tools, procedures 
and processes for the offering of advanced communication services in Premium 
IP networks across global infrastructures which might have a high degree of 
complexity in terms not only of scale, but also of the number of operators and 
level of technological heterogeneity. 

The paper is organized in six sections. The reference framework where this 
work has to be positioned is presented in section 0 Section 0 discusses the 
main issues related to the definition, negotiation and activation of Service Level 
Agreements in Premium IP Networks. An actual example of the applicability 
of these new concepts is shown in section 0 where we will expand on the is- 
sues related to the negotiation and management of QoS-aware Virtual Private 
Networks (VPNs). This will allow us to introduce a model for a novel network 
component, named the SLA Manager (SLAM), whose main objective is the dy- 
namic, SLA-based management of corporate traffic (section^). Finally, sectional 
provides some concluding remarks, together with some information concerning 
our future work in this field. 



2 Reference Framework 

This section introduces the general architecture proposed for the dynamic cre- 
ation and provisioning of QoS based communication services on top of Premium 
IP networks Such an architecture includes key functional blocks at the user- 
provider interface, within the service provider domain and between the service 
provider and the network provider. The combined role of these blocks is to man- 
age user’s access to the service, to present the portfolio of available services and 
to appropriately configure and manage the QoS-aware network elements avail- 
able in the underlying network infrastructure. Their internal operations comprise 
activities such as authentication, aggregation and a mediation procedure that 
includes the mapping of user-requested QoS to the appropriate service/network 
resources, taking into account existing business processes. 

In our view, network architectures are expected to be highly heterogeneous 
in terms of variety of systems and nodes, owing to the fact that they should 
be able to support dynamic service creation and service configuration on top of 
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generic QoS-aware IP networks. Closely related to this activity is the manage- 
ment of those resources in the underlying networks that are reserved at regis- 
tration/subscription, as well as those that are used - and maybe subsequently 
modified - when the service is invoked/configured. Associated with the reser- 
vation and usage of resources is the automated production and presentation of 
the corresponding SLAs to the user and the translation from the SLA to the 
corresponding Service Level Specification(s) (SLS) |2|. 

To the purpose, we have introduced three major components ^ (figure 0 
that we believe are needed to supervise the dynamic service creation and service 
configuration process: 

— Access Mediator (AM); 

— Service Mediator (SM); 

— Resource Mediator (RM). 



Legend: 




Fig. 1. The reference framework for Premium IP Networks. 



The Access Mediator is the device into which users input their requests to 
the system. It adds value for the user, in terms of presenting a wider selection 
of services, ensuring the lowest cost, and offering a harmonised interface: the 
Access Mediator presents to the user the currently available services. The source 
of the services is a so-called “Service Directory” database, but the Access Medi- 
ator performs processing of the raw information. For example, it can select the 
cheapest offer if a movie is available from more than one service provider, and it 
can notify the user as soon as a new movie becomes available that matches the 
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stored user’s profile. Its main role thus consists in assisting and easing the service 
selection process. This functionality may be under the control of a trusted third 
party and appears to offer excellent novel opportunities for a value-added service 
provider. The usage of a service generally involves two business processes: regis- 
tration to be a user of the service and invocation of the service at the moment 
when it is used (any modification of the service parameters during a session can 
be considered as a new invocation) . The following sequence of events is broadly 
applicable to both processes: 

— after authentication, the user requirements are captured, and the Access 
Mediator sends the information to Service Mediators (which in turn employ 
the Resource Mediators) to map the requested and (subsequently) selected 
service into the deployed physical network; 

— once the service selection has been agreed with all parties, the SLA is 
“signed” between the user and the Service Mediator; 

— records of usage and the associated SLAs are stored in the Access Mediator 
for future reference. 

A graphical user interface associated with the Access Mediator is expected 
to provide a harmonised interface to the user for all the available service offers. 

The Access Mediator may form associations with one or more Service Medi- 
ators to which requests are issued. Generally off-line, the Service Mediator will 
supervise the incorporation of new services, their presentation in the “Service 
Directory” and the management of the physical access to these services via the 
appropriate underlying network, using the Resource Mediator(s). It is the task 
of the Service Mediator to prepare the SLA for the user to sign, and subse- 
quently map the SLA from the Access Mediator into the associated SLS(s) to 
be instantiated in cooperation with the Resource Mediator(s). 

The Service Mediator has an important role, as this is the place where ser- 
vices are created and from where the impacts of service reconfigurations are 
communicated to the network resource management. The Service Mediator has 
to inform the Access Mediators (usually via the Service Directory) of all new ser- 
vice offerings, so to allow them to present the updated portfolio to their users. It 
also has to check that the addition of a new service, or invocation of an existing 
one, will not affect the services that are currently operational. 

In this scenario, a policy based approach is a possible solution to ensure the 
correct operation of the network. Subsequent to the service creation, a policy 
extension could be applied to the network to ensure that all services can be 
managed correctly. The system would have a global view of the configuration 
of the devices (including an accounting system) and of the policy rules to be 
applied. In such a case, it would be the function of the Service Mediator to 
update the service level management system with new rules and configuration 
as required, in conjunction with the Resource Mediators. 

The communication between the Service Mediator and the Resource Me- 
diator should be generic (i.e. independent of the technology employed by the 
underlying network). According to our design, it is the Resource Mediator that 
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will hold the current end-to-end view of the network QoS, by communicating 
with all the appropriate underlying network management systems. A network 
provider wishing to offer its resources should support an interface capable of 
handling messages defining an SLS, from its network management system to 
one or more Resource Mediator(s). The SLS templates we envision are in line 
with the descriptions in |‘2iS| . 

Since there can be more than one Resource Mediator, a Service Mediator 
can issue identical requests for information about network resource availabil- 
ity to several Resource Mediators. The Resource Mediators will either act on 
their own image of the network, or explicitly enquire to the individual network 
management systems, before returning an answer to the Service Mediator. The 
Service Mediator will accept the best offer, on the basis of the current policy 
decisions. 

In order for the Resource Mediator to maintain and update its end-to-end 
network view of the current QoS availability, it may use a set of policy rules that 
are agreed with the underlying network management systems 0. 

A common feature of the communication process surrounding the Access 
Mediator, Service Mediator and Resource Mediator components is a “one-to- 
many, search-and-selection” mechanism. In particular: 

— the Access Mediator is responsible for selecting the appropriate Service Me- 
diator(s), according to the user’s request; 

— the Service Mediator is responsible for finding - and, in some cases, building 
from individual elements - the service, requesting information from (and 
then selecting) the appropriate Resource Mediator(s); 

— the Resource Mediators are responsible for selecting the appropriate network 
capabilities, given several available options. 

The potential for developing a common protocol for all of these similar actions 
is described in 0. 

3 Service Level Agreements 

In this section we address the issues related to the definition of the Service 
Level Agreements 0 suitable for the services envisaged in the framework of 
Premium IP (PIP) Networks. As we saw in the previous section, PIP networks 
are capable of delivering new services to the end-users. Such services are char- 
acterized by different levels of Quality of Service (QoS). In this scenario, the 
definition of a service creation framework p] plays a major role, given its aim 
to dynamically create application-level services by appropriately combining and 
configuring pre-existing service components. Based on these considerations, we 
will provide definitions for SLA and a number of scenarios describing their re- 
lationship with the service creation and resource management framework. This 
set of definitions is part of a comprehensive conceptual model (SLA modeling 
framework), extensible in nature and capable to include sections specific to the 
different technologies or service architectures and cover SLA modeling for both 
transport services and end-user services. 
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3.1 Service 

A service is a concept which may be modeled from different perspectives. Whether 
acquiring services in a retail or whole-sale mode (see below) a service provision- 
ing/creation life cycle starts with a service description, which then requires a 
proper degree of formalization. Emphasis is on the different types of information 
that need to be modeled and on the needed ability to model relations between 
them. 

3.2 SLA 

An SLA is a contract between the customer and the provider of a specified 
service. Such a contract is signed upon subscription to the service itself. An SLA 
is prepared from templates specifically conceived for the available services. 

SLA templates are used during customer negotiation to define the required 
level of service quality. The production of an SLA template is an intrinsic part 
of service development. These SLA templates may either relate to standard 
product/service offerings, where they are used “as-is” to define the required level 
of service quality, or provide a baseline for custom negotiation (either automated 
or human-assisted). SLAs are defined on something perceived by the customer 
(i.e. explicitly subscribed to), that is the service elements composing the service 
product offering. 

3.3 Retail vs Whole-Sale SLA 

The retail SLA refers to the agreement between an end-user and a service 
provider. The end-user might be either a single person or a user organisation 
(e.g. a corporate or a public institution). Such an end-user could be induced to 
establish a SLA with his provider in order to support different kinds of appli- 
cations. Some of the applications which we deem worth investigating are, for 
example: 

— Adaptive Multimedia Applications (e.g. Video on Demand, Video Confer- 
ence, etc.); 

— Voice over IP (VoIP); 

— Virtual Private Networks (VPNs). 

SLAs trigger the negotiation of hierarchical agreements between different 
contractors. In the case of multi-domain scenarios, service providers may need 
to create inter-network agreements in order to support their end-user SLAs. 
We call whole-sale SLAs these inter-provider contracts. A whole-sale SLA takes 
into account traffic aggregates flowing from one domain to another. In general, 
there is no direct connection between r-SLAs and w-SLAs. In particular, w-SLAs 
might not be based on parameters related to a single service but might focus on 
statistical indicators related to the Grade of Service of the entire bundle provided 
by one provider to one of its neighbors. The focus of this paper is mainly on retail 
SLAs. 
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3.4 Static vs Dynamic SLA 

As already stated, an SLA is a contract between two parties. To date, the general 
trend has been to consider only static SLAs: the contracts are instantiated after 
negotiations by human agents and their terms cannot be modified during their 
lifetime. We do believe that dynamic features are needed in order to better match 
the requirements of real-world operational scenarios. We envision at least two 
different flavors of dynamic behavior: 

— time- varying user requirements (with different time-scales of time variability 
induced by specific application characteristics); 

— time-varying network conditions (of which the user is made aware via feed- 
back signals raised by the network itself). 

Both flavors may induce the end-user to change over time the terms of a 
pre-established SLA. Typical usage scenarios are shown below. 

1. No time-varying user requirements. No time- varying network conditions: 

in this situation the end-user establishes a static SLA, with no feedback. 
This implies that, once successfully terminated the negotiation phase, no 
modifications to the contract are allowed. In case of an admission control 
failure, the user is provided with no information concerning the reasons be- 
hind it. Thus, he has no clues on how to better re- formulate his request. 

2. Time- varying user requirements. Time- varying network conditions: 

this case refers to the most complex possible scenario, which requires re- 
negotiable SLAs. During the negotiation and usage phases, the network may 
provide the user with useful information for tuning his request. The contract 
may be re-negotiated at any time. 

3. No time-varying user requirements. Time-varying network conditions: 

in this scenario the network is capable of keeping users informed about its 
state, even if the users themselves cannot re-negotiate contracts on the fly. 
To exploit network hints they are compelled to tear down pre-existing con- 
tracts and re-formulate their requests from scratch. 

4. Time-varying user requirements. No time-varying network conditions: 

here, users feel free to change their contracts, according to their new require- 
ments. It is obvious that in this case, users’ requests may incur admission 
control failures, due to the absence of specific data concerning the current 
state of network resources. 



Figure El summarizes the aforementioned scenarios. 
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Fig. 2. Dynamic SLAs and related applications. 



3.5 Content-Based SLAs 

The service creation framework we depicted in the previous section envisages a 
scenario where users contact an Access Mediator (AM) in order to gain access 
to a number of value-added services, by means of negotiation of specific Ser- 
vice Level Agreements. The AM, in turn, needs to interact with one or more 
Service Mediators, each providing a certain set of services, to retrieve informa- 
tion about the characteristics of the services themselves. Afterwards, it organizes 
these information in order to let the user choose the service that most appropri- 
ately fits his needs. Once a specific service has been chosen, the involved Service 
Mediator(s) is (are) in charge of interacting with one or more Resource Medi- 
ators which, eventually, configure network elements so to efficiently satisfy the 
negotiated requests. 

The process described foresees the generation of a number of documents 
(Service Level Agreement, Service Level Specification, policy rules), each de- 
scribing the same instance of the service at a different level of abstraction and 
thus requiring creation/interpretation by the modules (Access Mediator, Service 
Mediator, Resource Mediator) belonging to the corresponding level of the overall 
architecture. 

Digging into the details of such mechanisms, we can see in figure 01 that the 
Service Level Agreement is a contract between the end-user and the Service Me- 
diator, negotiated via mediation of the Access Mediator. Once this contract has 
been signed, the Service Mediator is in charge of translating it into an appropri- 
ate Service Level Specification, containing a technical description of the service 
itself. This translation is a uni-directional process, requiring some additional in- 
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formation on the SM’s side in order to retrieve, where necessary, service-specific 
data. 




The SLS is in turn given to the Resource Mediator, which translates it into 
a format that is the most appropriate for the QoS-capable network it manages. 
For example, it might build a list of policy rules, needed inside Policy Decision 
Points (PDP) in order to configure the underlying network elements (or Policy 
Enforcement Points - PEPs) via a policy protocol like COPS ^ . 

3.6 Issues Related to the Definition of Service Level Agreements 

As a general remark, an SLA should give the user the possibility to negotiate a 
certain type of service, among those offered by the network operator. We expect 
that most users will simply ignore the details of the service they expect from the 
network (especially those concerning the traffic characterization), either because 
such information is not available at all, or because they lack the motivation or 
the necessary technical skills required to understand their semantics. To ease the 
process of filling the contract template, a number of different SLA models might 
prove useful: the contract would become easier to understand, being focused on 
the actual needs expressed by the user. These SLAs may be considered as formed 
by two different parts, one containing information that does not depend on the 
particular application and the other containing application-specific data. The 
first might, for example, include: 
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— service level; 

— user authentication module; 

— information concerning availability/reliability of the service; 

— encryption services; 

— pricing and billing policies; 

— options(enabling, for example, contract re-negotiation in case of unavailabil- 
ity of the required resources). 

The second part of the agreement is analyzed in further detail in the next 
sub-section, devoted to examples on possible operational scenarios. 

What we want to point out here is that from the network perspective the 
need arises to unambiguously specify all of the details and the characteristics of 
the service the user is willing to receive. SLAs should thus be translated into 
related Service Level Specifications (SLSs) containing all of the technicalities 
associated to the service itself. SLSs should be independent from both the high- 
level applications they stem from and the low-level network infrastructures on 
which they operate. Work is in full swing in the Internet community to define 
the main aspects related to SLS definition. 

3.7 Example Operational Scenarios for SLAs 
Interactive Multimedia Applications 

Such applications include audio/video transmissions where a user connects 
to a video-server archive containing a number of movies that can be sent, in a 
streaming fashion, to a client host. In the same category we can also put those 
applications, like Video Conference and Tele- medicine, where video and audio 
data are generated from live sessions. For these applications, a mechanism is 
required to grant access to either the movie list or the session directory, in order 
to let the user choose the file/event he is interested in. The user is required to 
indicate the service he is willing to perceive, optionally specifying service life- 
time. After defining these parameters, the translation module has to retrieve 
the traffic characterization associated to the specified files/sessions, in order to 
insert it inside the network SLS. 



Virtual Private Networks 

In this section we will consider only issues related to the creation of an SLA for 
a VPN service with respect to the problems linked to the provision of Quality 
of Service guarantees. We will therefore not cope with any aspect related to se- 
curity or fault tolerance. We envision a scenario where a company, or in general 
an organization based on multiple facilities or sites, asks for the provisioning of 
a Virtual Private Network service as a way to efficiently interconnect its net- 
working infrastructures. We expect, therefore, to see the VPN service forward 
traffic generated by a variety of users, network infrastructures, services, and ap- 
plications. For example, we might consider a situation where two or more sites 
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of a company are connected via a VPN service so to have both data-like connec- 
tivity (LAN-to-LAN) and voice-like connectivity (VoIP interconnections). The 
SLA will therefore be related to the provision of services to a mix of traffic, with 
different requirements in terms of bandwidth and QoS. 

In the case of VPNs, the r-SLA negotiated with the Service Mediator might 
not be as fine-grained as needed to accommodate the internal needs of the com- 
pany/institution for which the VPN is being set up. Thus, the customer may 
apply further traffic management on its own premises. We will focus on these 
issues in the next two sections: more precisely, in section ^we will propose an 
SLA template for VPNs, while in section ^ we will show how effective manage- 
ment of the VPN resources may be achieved thanks to the introduction of a 
novel component, called SLA Manager (SLAM). 

4 SLAs for Virtual Private Networks 

While designing a template for a Service Level Agreement for VPNs, we kept in 
mind the fact that users are looking for simple solutions to complex problems: 
that is to say, when buying an enhanced VPN service, they would just like 
to express their needs at a high level of abstraction, with no need to bother 
with all the underlying technicalities. Thus, we exploited the capabilities of the 
aforementioned mediation components in order to fill the semantic gap between 
the user’s and the provider’s perspectives on the same service. 




Fig. 4. The “Drag&Drop” VPN. 



We provide the user with a friendly graphical interface (figure EJ enabling him 
to design his own private network by simply drawing a graph representing the 
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sites he wishes to interconnect, together with the related tunnels (represented 
by the edges in the graph). Furthermore, for each tunnel, we envisaged the 
possibility to choose its specific features: 

— type: mono or bi-directional; 

— required bandwidth; 

— desired quality; 

— time schedule: permanent or time-dependent (either periodic or related to a 
specified time frame); 

— authentication procedure; 

— encryption algorithm. 

This graphical representation of the VPN is then passed, in the form of an 
incidence-like matrix, to the Access Mediator, which in turn builds a formal 
description (e.g. in a language like XML) of the agreement the user is willing to 
negotiate (figure 0. 




Fig. 5. User to AM communication of the VPN features. 



This template SLA is the grounds on which lays the negotiation process 
involving all of the mediators and described in the previous sections. In partic- 
ular, the Service Mediator, upon reception of this information, is in charge of 
performing the following tasks: 

— retrieve information about links; 

— for each link: 

• create an SLS, based on the Virtual Leased Line concept | 2 |; 
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• compute the cost associated to this VLL, by contacting the involved 
Resource Mediators (this might require “SLS/policy rules” translation). 

— compute overall cost (e.g. sum of all link costs); 

— send cost back to the AM. 

Figure El gives an idea of how the SLS for a single Virtual Leased Line might 
look like. 



-Scope: one-to-one (143.225.229.254,143.225.170.254) 

-Flow description: (143.225.229.0,143.225.170.0) DSCP=EF 

-Traffic Conditioning: token bucket (b,r): r=128kbps 
-Excess Treatment: dropping: only in-profile packets allowed 
-Delay guarantee: qualitative (e.g. delay="low") 

-Loss guarantee: p a 0 (implying a throughput guarantee R a r) 
-Service Schedule: Weekly, Sam on Tuesday to 8pm on Wednesday 
-Reliability: May be specified 

- Options: Authentication (from CA) , 3DES encryption 



Fig. 6. SLS for a Virtual Leased Line of the VPN. 



After retrieving overall service costs, the AM is able to build the candidate 
SLAs and send them to the user: once the user has chosen a specific offer, it can 
store the associated SLA inside the user’s repository and notify SMs about the 
final decision, so to let them appropriately configure network resources. 

It should be noticed that in this scenario, one SLA corresponds to multi- 
ple SLSs and only at the SLA level of abstraction the concept of a VPN does 
exist: the only component who speaks “Dragged&Dropped VPNs” is the AM. 
Furthermore, the SLSs are both service-independent and network-independent. 

5 A Model for an SLA Manager 

One of the issues to be considered in SLA based Premium IP networks is related 
to the possible complexity of the interactions between the users and the network 
entities responsible for the presentation and negotiation of SLAs. An SLA is a 
complex set of data, pertaining to different aspects of the provision of a network 
service or application: as already mentioned, in the case of a video delivery ser- 
vice, an SLA can include content dependent elements. In this business scenario. 
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the user is requested simply to accept the purchase of a service (for example, 
the delivery of a movie or of a multimedia document) and to ask the network 
(via the Access Mediator) for the provision of a communication service suited 
to the performance requirements of that specific content /service. Even though 
novel services could introduce new, more complex business scenarios, we expect 
that such kind of interaction between an end-user and the Access Mediator will 
remain quite simple, and perfectly manageable directly by the final user himself, 
for example via a web based interface. 

However, we believe that there exist cases where such interactions could 
become much more complex, due to the presence of multiple technical and com- 
mercial aspects related to the nature of the offered service and to the specific 
needs of what we define the end-user. This is the case, for example, of the provi- 
sion of a VPN service for the interconnection of multiple network infrastructures 
via public Premium IP networks. 

Virtual Private Networks are being considered as the real killer applications 
for future networks. From the network provider’s standpoint this is primarily 
due to the complexity of their setup and management. Users, on the other side, 
are mostly interested in the possibility of exploiting an infrastructure capable to 
provide services with guaranteed QoS. 

In a QoS-capable network, a VPN service will be based on the idea of pro- 
visioning, on top of purely virtual links, a tailored communication service that 
could be differentiated from existing ones in terms of: 

— performance and QoS; 

— monitoring and accounting; 

— security and privacy. 

As far as the first issue, we expect that a VPN should offer a bouquet of 
services corresponding to the possible different QoS requirements that the data 
flowing across the VPN itself might demand, according to policy decisions that 
are — and should be kept — free to be defined and set by the user. Indeed, 
since VPNs are more and more linked to the need of creating and operating 
the so called “network companies”, one of the major issues will be dynamic 
management of the virtual communication infrastructure. 

We introduced two differently-grained SLA types: (retail) r-SLA and (whole- 
sale) w-SLA. The former is negotiated between an end-user and a service provider 
on a relatively small time scale, involving generally small resource amounts. 
The latter, instead, is negotiated between two different network domains less 
frequently than r-SLA, in order to create a pipe which merges different flows 
relating to r-SLAs already instantiated. 

In the case of a users’ organization, for example, an organisation representa- 
tive signs an SLA with the access provider, reserving a large amount of resources. 
Then, single organisation members can apply for a smaller part of these resources 
up to the bulk available. Further requests have to be declined or a new contract 
has to be (re)negotiated by the organization. 

In such a scenario the necessity arises for an entity that manages bulk re- 
sources, assigns sub-portions of them and is capable, in case of need, to (re)negot- 
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Fig. 7. The Service Level Agreement Manager. 



iate larger quantities of resources. It can be considered as a mediator between 
a single user and a larger generic service provider (reached via an Access Medi- 
ator). This entity is in charge of negotiating with the Service Mediator a retail 
SLA that applies to the users’ organization as a whole. We call SLA Manager 
(SLAM) this entity (figure C|l. A SLAM should have the following major func- 
tions: 

— towards the single users: 

• AAA (Authentication, Authorization & Accounting); 

• internal negotiation of the r-SLA, in the form of what we call “r^-SLA” 
(figure 0; 

• providing a friendly GUI; 

• enabling a user preference profile to be “bookmarked” for future use. 

— towards the Access Mediator: 

• (re)negotiation of the r-SLA. 

— from a more global perspective: 

• policing and/or shaping of single user’s flows; 

• admission-control; 

• providing a fair sharing of total resources; 

• triggering local network configuration. 

The SLAM is therefore a new entity, placed between an existing Access Me- 
diator and the end-user, but surely pertaining to the user’s domain. 

We are currently prototyping a SLAM entity for VPNs, based on a model 
where QoS is taken into account at different levels of granularity inside the VPN 
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Fig. 8. The role of the SLAM. 



tunnels. More precisely, a small and fixed number of traffic classes are considered 
in the core of the network (as delimited by the LAN gateways) and share the 
available bandwidth in a controlled way (exploiting, for example, a Priority 
Queuing algorithm ESI). Inside each class, a further level of discrimination is 
applied, by identifying the single micro-flows and deserving to each of them an 
ad hoc treatment via an additional scheduling (e.g. Weighted Fair Queuing EDI). 
Figure 0 gives an idea of how this scenario is provided. 
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Fig. 9. Scheduling flows inside VPN tunnels 



We implemented such scheduling algorithms on a programmable networks 
platform 0, where they may be combined in a structured fashion, thus bring- 
ing to complex configurations, ranging from simple series/parallel structures to 
nested ones. The situation depicted in the figure might relate, for example, to 
the case where all IP telephony flows are given priority over those generated 
by other applications (as guaranteed by the external scheduler) and are further 
discriminated among each other, by appropriately configuring the internal WFQ 







658 



M. D’Arienzo et al. 



scheduler (that acts as a “gate keeper” element). In this case, the SLAM is re- 
sponsible for negotiating the r-SLA related to the bulk resources assigned to the 
VPN tunnel, while re-distributing them in a controlled fashion in the inner local 
network. 

6 Conclusions and Future Work 

In this paper we presented a novel approach to the design and development of a 
global architecture for the effective deployment of value-added Internet services 
upon Premium IP networks. As we saw, this is an ambitious goal, requiring 
a comprehensive understanding of all the procedures involved, from user-to- 
network interaction all the way through appropriate configuration of network 
devices, passing by a formal description of the service. Based on these consid- 
erations, we pointed out the need for a thorough definition of the concept of 
Service Level Agreements and associated Service Level Specifications. We then 
presented a model based on a modular decomposition of tasks involved in the 
deployment process, exploiting at their best the concepts of “mediation” and of 
recursive group communication. Finally, to give an idea of how this concepts ap- 
ply in a real-world operational scenario, we focused on an actual example related 
to dynamic, SLA-based management of Virtual Private Networks. 

This definitely represents the research field we are mostly interested in further 
investigating, due to the challenging topics it proposes. We are firmly convinced 
that the approach we presented may prove extremely useful for next generation 
Value Added Service Providers (VASP). 
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Abstract. The number of Internet and IP-capable devices and networks will 
increase dramatically in the near future. In order to exploit their potential in 
multimedia communications, we need a service platform that can handle the 
management and control of the multimedia streams independently of the access 
networks. In this paper we introduce a distributed service platform that is based 
on CORBA and we discuss the user and device management in such system in 
more detail. In addition to that, we try to discover various problems, design 
issues and possible ideas linked to this kind of system, so that we can use this 
information later on while designing similar horizontal service platform 
solutions for commercial use. 



1 Introduction 

Traditionally each access network has its own set of services and only lately we have 
seen attempts to combine the service offering from various access networks; e.g. 
WAP (or I-mode) devices are able to use the Internet services in addition to the 
operator's WAP services. The ultimate goal might be that the access network doesn’t 
limit the amount and type of available services. On the other hand, some of the 
services in the fixed network side are not interesting for the mobile users and vice 
versa. Yet, it might still be fruitful to combine things related to the user, devices, 
billing, etc. at the management side. Our approach assumes that all services are 
accessible from every access network. This doesn’t mean, however, that we should be 
able to watch movies from a GSM phone. The idea of the distributed service system, 
discussed in this paper, is that a device from any access network could be used to 
order and control stream services and the processing of the stream can be handled on 
some other device like PC or future IP-TV. This approach adapts quite easily to 
current service platforms, because many access networks already take care of the user 
authentication and billing; a GSM user can be automatically authenticated and 
charged, while an ordinary PC user necessarily cannot. 

Generally, bit rates are increasing, but they are not in the required level, where we 
could use the broadband (from appr. 100 kbits/s to several Mbits/s) stream services 
from any access network. Only certain fixed networks and WLAN connections are 
already able to provide this type of multimedia services. However, the bit rates are not 
the only thing slowing down the evolution of multimedia services; the pricing models 
are not up-to-date, and currently high service prices are preventing the usage of the 
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broadband multimedia services. On the other hand, the lack of services might have 
something to do with this also. The pricing models should be able to handle 
multimedia services properly, e.g. the users should be charged only from the services 
they use, not from the control traffic they have to generate while making the order. 
This kind of service specific billing means that the operators must co-operate to 
enable these services and agree on the service expenses, e.g. roaming, without 
additional payment for the user. The operator expenses could be covered in the 
service payment, but for the user this means a clear and practical way to order 
services, compared to the current model. 

From the operator’s point of view it is evident that the number of different access 
networks and also the number of users will increase by the time. This creates general 
management problems and some day current vertical service solutions, where each 
access network has its own set of services, don’t scale anymore for the user’s and 
operator’s purposes. The horizontal solution, which is independent of access networks 
in the application layer, brings us a possibility to manage users and services 
efficiently. The user identity and profile information and also the view to the service 
system stays the same, independently of the access mechanism. All-IP solution, which 
is also the ultimate goal for the Universal Mobile Telecommunications System 
(UMTS) according to 3GPP’s vision [1], enables this kind of horizontal service 
concept. Along with the All-IP solution we will eventually get IPv6 [2] to all mobile 
devices and that also forces the fixed world to move towards IPv6 networks and 
devices. That means that we have to address all those new requirements (i.e. how to 
identify the mobile devices?) coming from the All-IP world. 

The distributed service platform introduced in this paper attempts to discover 
various problems and possible ideas that can be used later on while designing similar 
horizontal service platform solutions for commercial use. It is assumed that all 
network elements are capable of communicating with IP (either IPv4 or IPv6) and 
because we have chosen CORBA [3] as the control mechanism, it creates also 
requirements for the Media Client, which handles the actual service processing. The 
next section describes the distributed architecture briefly by introducing all the 
important entities involved in the service control process. The third section discusses 
about user and device management concentrating on problems and key issues on 
solving them. Section 4 covers additional design issues related closely to the user and 
device management and Section 5 concludes the paper. 



2 Distributed Architecture 

The distributed stream service architecture is based much on the earlier work with the 
stream services done in [4] and [5]. The current scenario introduces four network 
elements that form the basis of the stream service architecture; WWW server, Access 
Server, Media Server and Media Client. The WWW server is responsible mainly for 
providing the user interface for ordering the services while the other network 
elements are involved in the actual stream control and negotiation. Figure 1 will give 
a view of the network elements involved in the architecture. 
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The WWW server. Access Server and Media Server communicate via a CORBA 
backbone that can be isolated from the public IP network e.g. by using a VPN, SSL or 
some other secure mechanism. The Media Client part of the CORBA control is not 
directly connected to the CORBA backbone mainly because the user should not be 
part of the operator's control network and secondly because the control information 
sent to/from the Media Client is non-critical and not too confidential. The following 
subsections give more detailed information on the network elements and their 
functions. 



2.1 WWW Server Objects 

The main functionality of the WWW server is to provide the interface towards the 
users accessing the stream services. Those users might have different capabilities in 
terms of access network bandwidth, display size, etc. and the WWW server must be 
able to fulfill those different requirements when providing the user interface. In this 
current architecture we have considered the WWW server to be accessed only via 
HTTP protocol. This comes mainly from the fact that the bit rates of the mobile 
networks (GPRS, UMTS, Bluetooth, WLAN) and display sizes of the mobiles (PDAs, 
Communicators) are gradually increasing and the WAP protocol will become 
unpractical over the time. However, it is possible to add a WAP gateway between the 
user and the WWW server if there is need for content conversion to the WAP enabled 
devices. 

The WWW server provides a personalized view on the services available for the 
user and it supports different user roles and profiles for payments and browsing. It is 
also responsible for communicating and forwarding the selected service information 
to the Access Server, which is the main control element in the service system. 
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2.2 Access Server Objects 

The Access Server is the most important element in our service system and it consists 
of several functional entities, which can be distributed to several computers to balance 
the load. The functional entities are 

■ Access Manager, 

■ Customer Manager, 

■ Device Manager, 

■ Order Manager, 

■ Service and Content Manager, 

■ Billing Manager, 

■ User, device, service and billing databases, and 

■ Naming Service. 

The Access Manager is responsible for user and device authentication and 
granting the access rights. The WWW server forwards the authentication parameters 
from the user to the Access Manager and the Access Manager checks the user 
properties from the database. If the user is authenticated, the Access Manager creates 
a Customer Manager object to handle the user specific parameters and settings. 
Similarly, for valid devices the Access Manager creates a Device Manager to handle 
the device specific functions. 

The Customer Manager takes care of the user roles and profiles and it is also 
responsible for user authorization. The user can modify his/her profile and change the 
role (e.g. from anonymous to identified) through the functions provided by the 
Customer Manager. The WWW server also uses directly some of the functions of the 
Customer Manager to modify the view of the web pages. The Customer Manager also 
creates Order Manager objects based on user selections to take care of the actual 
service. 

The Device Manager is quite similar to the Customer Manager. The biggest 
difference is that it handles the roles and registrations of the user devices without any 
user interaction. The devices can inform the service system whether they are online or 
not, and what kind of network and stream parameters they are willing to accept. 

The Order Manager controls the actual stream service transactions based on the 
information it gets from the user and the Customer Manager. When the user wants to 
make an order, the Order Manager takes care that the device and user parameters do 
not conflict. After the order parameters are selected and the user has confirmed the 
service, the Order Manager contacts an appropriate Media Server and forwards the 
service request to it. The Order Manager is also responsible for communicating with 
the Billing Manager and initiating the billing transaction. 

The Service and Content Manager handles the media information registration to 
the databases. The Media Servers register their media content, when they are started 
at the first time or when new content is stored. Many of the parameters are only stored 
in the Media Servers, but the parameters, which affect the user decision or describe 
useful information to the user, are forwarded to the Service and Content Manager. 

The Billing Manager is also part of the Access Server and it collects charging 
information throughout the service transaction. Both the Order Manager and the 
Media Server inform the billing system about the status of the stream service. Based 
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on that information the Billing Manager can charge the user correctly. The user itself 
(Media Client) is not part of the billing, because the Media Server is in this current 
architecture aware of the status of the stream service. 

The rest of the Access Server is composed of several databases and naming service 
functionality. The databases contain information of the service system users, 
identified devices, services and billing. The databases are queried through a number 
of database adapters. The naming service provides name to lOR (Interoperable Object 
Reference) mapping and all the objects within the service system, including the Media 
Client, must register to it. Figure 2 shows the relationships between the different 
Access Server objects. 




Fig. 2. Access Server objects and their relationships 



2.3 Media Server Objects 

The reason for separating Media Server from the Access Server comes from the idea 
that the media information can be achievable through a well-known place. Access 
Server, and the actual (sometimes high bandwidth) media data can be stored in 
several Media Servers. The Access Server can select the Media Server that is located 
near the end-user, thus keeping the bandwidth usage somehow limited. The Media 
Server and also the Media Client are based somewhat on the scheme that is 
introduced in the AfV Stream specification [6]. The A/V Stream specification gives us 
the building blocks that can handle stream controlling between two devices. 

The intelligence is in the Stream Controller, which is responsible for setting up and 
configuring the stream and the network connection between two Multimedia Devices 
(Media Server and Media Client). Parameter negotiation is considered as part of the 
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Stream and network connection configuration. In addition to the A/V Stream part, the 
Media Server contains the Media Server control object, which is responsible for 
contacting the Media Client and requesting a permission to connect to the Multimedia 
Device of the Media Client. It also handles billing and forwarding of the stream and 
the network parameters to the Stream Controller. The control part of the Media Server 
is also responsible for providing the Stream Controller’s lOR to such an entity, which 
may want to control the stream remotely (play, pause, rewind, etc.). 



2.4 Media Client Objects 

In our scenarios we have defined the Media Client as an entity, which is able to 
process streaming audio and video, and has decent network connection 
characteristics. It can be a future TV, an ordinary PC or some other terminal with 
adequate multimedia capabilities. Before the user can order anything to the Media 
Client device, the device must register its identity to the Access Server. This 
registration can include device specific information like network and stream 
capabilities, but it is enough to just register the lOR of the Media Client and the 
activity status (ON/OFF). The user can be associated with one or several Media 
Clients and vice versa. For that purpose the registration can include AccessID(s) for 
that specific Media Client. 




Fig. 3. Media Server and Media Client architecture 



In addition to the registration and other control functionalities, the Media Client 
includes also A/V Stream specific objects in order to be able to negotiate and 
configure the network and stream parameters with the Media Server. Figure 3 depicts 
a more detailed version of the Media Server and Media Client objects and 
connections. The stream model is based on [6]. 
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2.5 Example Scenario 

Before describing the user and device management in this kind of system, Figure 4 
shows a scenario, where all the defined network elements are involved in the control 
of the stream service ordered by the user. 

A user, who wants to order something to a Media Client contacts the WWW 
server, authenticates himself and selects a stream service. The WWW server provides 
a personalized view for the user based on the user profile and role throughout the 
transaction. If the user has associated some devices with himself, the Access Server 
provides that information to the user through the WWW server and the user is able to 
select e.g. “my TV” as the desired Media Client. When the service details are agreed, 
the Access Server forwards the service information to the Media Server, which 
initiates the actual service after it has negotiated the stream and network parameters 
with the Media Client. 




Fig. 4. Example scenario 



3 User and Device Management 

In order to handle the service system operation as smoothly as possible, we should be 
able to connect the users, devices and services in such a way that the user doesn’t 
have to care about registrations or addressing and, moreover, the services should be 
somehow associated with the device capabilities. This chapter deals with the user and 
device management problematics and proposes solutions for those problems. 
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3.1 Connecting the User, Device and Service 

The ease of service usage comprises of many important things related to connecting 
the user, device and service smoothly to each other. One of the principle questions in 
this kind of service system, where the purpose is to order multimedia services with 
device A and to deliver the service to device B, is how we take into account the 
properties of the destination device (B, Media Client) while we are browsing and 
selecting the service. The are at least two different possibilities; 1) we choose some 
service from a broad selection of services and then we point out the destination device 
where this service should be sent to, or, 2) we select the destination device first and 
we choose the service from the list of services that are suitable for that specific 
device. Table 1 describes the advantages and disadvantages of these two different 
ordering models from the viewpoint of both user and service provider. 

Table 1. Comparison of the ordering models 

Ordering Model Advantages Disadvantages 

Service => Device - better selection of services - device does not necessarily 

support the service 

- clearer classification model - cumbersome way to choose 

services 

possible user dissatisfaction 

Device => Service - device is capable of - some of the available services 

processing the services are not visible to the user with 

limited capabilities 

- better customer satisfaction 



The users should be able to easily identify the destination device (Media Client), 
which is used to process the media service, and therefore the service platform 
databases should contain information of the user-associated devices, their status and 
capabilities. Generally, one user can be associated with several devices and, on the 
other hand, some device can be associated with multiple users. A device must register 
itself to the Access Server (device database), before we can order anything to it. This 
can happen automatically in a such way that the Media Client control object contacts 
the Access Manager and authenticates itself. After the authentication the Access 
Manager creates a Device Manager object to handle the actual device capability 
registration. Figure 5 shows the Media Client registration scenario. 

Before the registration can be successfully handled, the Media Client must be 
connected to some user (owner), which has valid account in the Access Server 
system. So a device is registered to one user, but several users could have association 
with it. All devices in the device database have unique devicelD and associated to that 
devicelD there is information about the device capabilities like 

■ Supported media types (MIMEs), 

■ Min/max bandwidth for audio and video, 

■ Min/max video framerate, 

■ Min/max video resolution, 

■ Supported audio and video codecs, 



668 R. Lehtonen and J. Harju 



■ Control method for device (SIP, CORBA), 

■ Device information (name, type, model, etc.), 

■ Device address (SIP-URL, IPv6, IPv4, lOR), and 

■ Device activity. 




Fig. 5. Media Client registration with the Access Server system 

This set of information is delivered to the Access Server while registering the 
Media Client. The information consists mainly of the stream and network 
(audio/video) parameters, hut there is also information about control methods and 
device activity. The device activity parameter informs the Access Server, whether the 
Media Client is ready to take stream requests or not. 

The user database contains information about user characteristics, profiles, roles 
and device associations. A user can be associated with several devices and the user 
can name these devices quite freely, however, they should be named uniquely within 
the specific user. The relationships between users and devices are clarified in Figure 
6 . 



Device database 



User database 




Fig. 6 . Relationships between users and devices 
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When the required information about users and devices are in the databases, the 
users can see the registered and associated devices (Media Clients) while making the 
service order and because of that they can easily select the destination device with 
knowing the exact IP addresses, port numbers and so on. The other possibility is to 
select the destination device among the non-associated devices. Then the user has to 
know the address of the device. 



3.2 User and Device Addressing 



Before the users can identify the devices unambiguously, the user and device 
addressing must be unambiguous and location independent. IP addresses are complex 
and hard to remember, especially IPv6 addresses. Same goes for the DNS addresses, 
which does not always have any connection to the device identifier. Because all the 
service system users must be registered to the Access Server and every device is 
associated at least with one user (device is registered to one user), the user and device 
addressing is handled in the following way: 



ACS address 
devicename 

username 



devicename username 



*( unreserved | escaped | 

"$" I ) 



*( unreserved | escaped | 

"$" I ) 



" + 



II 



" + 



II 



reserved 



_ II . II I II y II I II -p II I II . II I 

I "+" 






II _ II 



The syntax is described using Augmented Backus-Naur Form following the style 
and rules shown in the RFC 2543 [7]. The username takes in a way the form of a 
domain name in this syntax and the devices belong to that user (domain). Here are 
few examples of valid ACS (Access Server) addresses: 



pc . sonera@rami . lehtonen 
PC106 . tampere@sonera 



The first address could point to the computer in the Sonera’s office, which is used 
by Rami Lehtonen. The association is between the device (some PC) and the user. 
The other address could point to the same PC, however, the association is this time 
between the device and Sonera (as a community). By identifying just the 
user/community, we could get information on the associated devices for that 
user/community. The ACS addresses are valid only within the service system. 

In any case the addressing will be mapped to IP level addressing, but with this 
system specific addressing, we can hide those issues from the users. What comes to 
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the IP level addressing, we handle the Media Clients as servers, so they have to be 
addressable from the public IP network. The lack of public IPv4 addresses creates 
also requirements for the IP level infrastructure. This kind of service system requires 
IPv6 network to be fully functional, mainly because the Media Server must be able to 
connect to the Media Client and in that case the Network Address Translator (NAT) 
[8] systems do not help. 

The change in device’s IP address can be handled in the registration phase by 
setting the new IP address to the device database. If the device’s IP address changes 
while the device is online, it should renew its registration to the service system. The 
actual ACS address remains the same even though the device's IP address may change 
during the time. So, this system works in a way better than the DNS systems 
excluding Dynamic DNS systems. This addressing model is not, contrary to the 
IP/DNS addressing, location dependent and hierarchical and thus enables more 
practical use of addressing and mobility of the devices. 



4 Design Issues 

This section focuses on a few special design aspects of this service system. First we 
compare the functionality and features of the distributed model to the service system 
that we have designed and implemented earlier. Then we look at the security aspects 
of the CORBA control connections like secure CORBA backbone and explain how to 
guard the Media Client against different types of misuse. 



4.1 Comparison to the Earlier Model 

The previous version of the service system that allowed us to send control information 
to the Media Client via the Access Server is described in the papers [4] and [5]. The 
earlier approach was based much on the Service Control Protocol (SCP) that handled 
the parameter forwarding and negotiation between the Access Server and the Media 
Client. In the distributed model the parameters are circulated through the Media 
Server, before they reach the Media Client. So the Access Server is not anymore in 
direct control connection with the Media Client (exception: Registration). The reason 
why we had to forward the stream and network parameters directly from the Access 
Server to the Media Client was that fact that we worked with commercial or public 
domain Media Servers, which couldn’t handle our control mechanism. In the 
distributed system described here, we can examine the effects of the stream 
parameters in more detail and we can also examine the billing scenarios, because now 
we can get hands on the actual service delivery. By moving the parameter forwarding 
and negotiation to the Media Server instead of the Access Server, we can also ease the 
load in the Access Server and thus have a more scalable architecture. What comes to 
the billing, the Media Server can now inform the Billing Manager about the service 
properties that can be used in the billing process. If the stream delivery is interrupted, 
we can signal the Billing Manager and it can take proper actions for correcting the 
charge. The following Table 2 will summarize the main differences between the two 
systems. 
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Table 2. Comparison between two different models for stream management 



Old Model 



New Model 



- Centralized model (Access Server) 

- Own control protocol (SCP) 

- Control connection from Access Server 
to Media Client 

- Stream connection initiated by Media 
Client 

- Commercial Media Servers 

- No billing management 



- Distributed model (Access Server) 

- CORBA based control (AA^ Streams) 

- Control connection from Media 
Server to Media Client 

- Stream connection initiated by Media 
Server 

- Media Servers actively involved in 
control 

- Billing management possible 



4.2 Security Aspects 

The Access Server has a lot of valuable information about the users, devices and 
services and therefore the users must trust that their information is safe within the 
Access Server. In order to assure of that the Access Server must be behind a firewall 
and that way separated from the public IP network. The Access Server must, 
however, communicate with many Media Servers and also the Media Clients must be 
able to register themselves to the Access Server. The firewalls can be quite easily 
restricted to allow only the HOP (Internet Inter-ORB Protocol) protocol and so the 
Media Clients are able to connect to the Access Server. Media Clients also have to 
authenticate themselves to the Access Manager, and in case of successful 
authentication, the Access Manager creates a Device Manager object to handle the 
Media Client registration. The Media Client can access only those two objects. The 
control traffic between the Media Client and the Access Server must be always 
encrypted using e.g. SSL, HOP level encryption or IPsec. Similar encryption can also 
be used between the Access Server and Media Servers, but we can also create VPN 
(Virtual Private Network) type of connection between those two entities, because they 
are likely to be quite static network elements. The VPN connection can be secured 
e.g. by using IPsec. So the secure CORBA backbone consists of those VPN 
connections and also from the connections between the Access Server and the WWW 
server, which are always isolated from the public IP network. 

If we look the security aspects from the user’s point of view, we are concerned 
mainly with the security of the access connection that is used to connect the user to 
the WWW server. That’s the most critical interface from the user’s perspective, 
because the billing is based on the information changed over that interface. Normally 
we can have access network authentication for the users (e.g. GSM authentication) 
and then above that we authenticate users in the application level. The application 
level connection should be encrypted and we can use e.g. SSL on the top of the 
TCP/IP layer to secure the information change. 



4.3 Restricting Media Client Usage 

The other part of the client security is the Media Client connections. Because the 
Media Client acts as a server and waits for the Media Server requests, we must be 
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somehow sure that the requesting side is a valid Media Server and that the order 
originated from some trusted and allowed user. Therefore we have defined a 
parameter called AccesslD, which can be considered as a password that must be 
included in the StreamRequest (see Figure 3) message. The user that makes the order 
through the service system must give the AccesslD, and the Access Server forwards 
this parameter along with the stream and network parameters to the Media Server 
responsible for delivering the media content. The Media Client can accept a variable 
amount of AccesslDs and based on those AccessIDs it can make service and user 
differentiation. An AccesslD can be individual, it can be common to some group or it 
can be a device specific identifier. This decision depends on the Media Client 
configuration. In some cases, where the Media Client wants to avoid unnecessary 
control traffic to reach it, it can inform the Access Server about the valid AccessIDs 
and then the unauthorized users can be blocked out at an early stage. 

The control traffic between the Media Server and the Media Client does not 
contain any sensitive information except the AccesslD. If we avoid sending the 
AccesslD over this control connection, the control connection can be implemented 
without any encryption. One way of doing this can include one-way functions, so that 
we calculate “a checksum” from the AccesslD by using some one-way function. This 
checksum is sent to the Media Client and it uses the same function to calculate 
another checksum. If the checksums match we can be sure that the AccessIDs also 
match. However, this does not protect against replay attacks. In that case we need to 
include also some other parameters (like timestamp, nonce, source IP address, etc.) to 
the checksum calculation. The last option is to encrypt also the control connection 
between the Media Server and the Media Client. 



5 Conclusions 

This paper describes a system that can be used for instance to order stream based 
personalized services for a number of terminals, wired or wireless, with different 
capabilities. Building this kind of system is not a trivial task and various issues like 
user and device management must be carefully examined. By introducing CORBA for 
the control purposes we can model the system in a higher abstraction level and we can 
combine a number of various elements like billing and service management with the 
core system. Future networks and terminals will be IP-based and the terminals 
(mobile and fixed) also move towards open platforms. Together they enable a lot of 
new possibilities for network operators and service providers. The key elements in 
those systems include a flexible control mechanism, efficient and scalable architecture 
and well-designed management functionality. 
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Abstract. The 1ST project AQUILaQ aims to develop a flexible, extendable 
and scalable Quality of Service architecture for the existing Internet. The core 
network will be an enhanced DiffServ network providing several dynamically 
manageable traffic classes with specific QoS parameters, per hop behaviors, 
and other “guidelines” that realize different network services. A new logical 
layer has been defined on top of the DiffServ network, the Resource Control 
Layer. The task of this layer is to control the underlying network in order to 
provide QoS features to the customers of the network. Resource Control Agent, 
Admission Control Agent and End-user Application Toolkit (EAT) are the 
logical components of this architecture. Legacy as well as new QoS-aware ap- 
plications running on the hosts will use the EAT middleware to benefit from 
the QoS capabilities of the AQUILA architecture. The AQUILA project is split 
into two phases. Now it has completed the definition and implementation and it 
has run the first trial. The architectural definition and the trial results are dis- 
cussed in this paper. 



1 Introduction 

In order to satisfy the huge commercial demand for Quality of Service (QoS) solu- 
tions over IP networks, the project AQUILA defines, evaluates, and Implements an 
enhanced architecture for Quality of Service. The Differentiated Service (DiffServ) 
approach for QoS provisioning in IP networks is used as basis for the specification of 
this architecture. A goal of the project is to verify the achieved technical solutions hy 
testhed experiments and hy trials involving end-users. The trials include QoS de- 
manding on-line services like multimedia services fll. 

Hereafter, the overall objectives that were established at the beginning of the pro- 
ject will be presented, followed by a discussion on the main innovation expected from 
the project. Key objectives of the project are: 
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• To enable dynamic end-to-end QoS provisioning in IP networks for QoS sensi- 
tive applications e.g. Internet telephony, premium web surfing and video stream- 
ing. Static resource assignments will be considered as well as dynamic resource 
control. 

• To design a QoS architecture including an extra layer for resource control for 
scalable QoS control and to facilitate migration from existing networks. The 
DiffServ architecture for IP networks will be enhanced introducing dynamic re- 
source and admission control. Main features are: 

The architecture will be usable by any relevant kind of IP application. 

The architecture will be cost-effective, scalable and backward compatible for 
the provisioning of QoS in IP networks covering both the inter- and intra- 
domain QoS. 

• To implement prototypes of the QoS architecture as well as QoS based end-user 
services and tools in order to validate the technical approach of the solution de- 
sign. This includes 

Development of a novel resource control layer extending Bandwidth Broker 
functionality. 

Provision of an End-user Application Toolkit (EAT) in order to support the 
establishment of QoS by end-users and applications. 

Creation of tools for QoS provisioning, monitoring and management in order 
to facilitate operators to control QoS IP networks. 

Development of a distributed measurement infrastructure for end-to-end 
QoS parameters. 

• To validate the QoS architecture in a field trial involving a commercial online 
service. To prove the concepts for larger scale networks, higher network load and 
different kinds of end-user services within a distributed testlab and by simula- 
tions. 

Under the following headlines, the specific innovations of the project are explained in 
more detail. The strength of the project is to address all the individual issues under a 
unified perspective, while achieving specific innovations in each single aspect. 

• Scalable and flexible Resource Control Layer 

A major innovation of the project is a new layer on top of the DiffServ network, 
called Resource Control Layer (RCL), which is used for controlling and managing the 
network resources. This layer can be seen as a distributed Bandwidth Broker in the 
DiffServ architecture. A node in the Resource Control Layer is called a Resource 
Control Agent (RCA). The Resource Control Layer will include both the intra-domain 
aspects (i.e. the dialogue of the RCAs with end-users and with the networking devices 
like Edge devices and core routers) and the Inter-domain aspects (i.e. the dialogue 
among RCAs belonging to different domains). In comparison to the current band- 
width broker proposals and prototypes, the project will introduce: 
a distributed architecture as a base for scalability and reliability, 
active mechanisms in the Resource Control Layer in order to adapt network con- 
figuration according to current traffic loads, 

measurement based admission control in Resource Control Agents, 
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a highly reliable Resource Control Layer design 

integration of defined DiffServ classes and mechanisms (Per Hop Behaviors - 
PHB) into the layered QoS architecture. In particular, suitable con- 
trol/management interfaces between IP networking devices and the resource con- 
trol layer will be defined. 

• End-user Application Toolkit 

In order to provide QoS for end-user applications, the AQUILA project comprises 
the development of an End-user Application Toolkit (EAT). This toolkit can be used 
by application developers and provides reusable and generic components for both 
client and server sides of applications. 

The EAT enables the construction of QoS-aware applications as well as the migra- 
tion of legacy applications to QoS-awareness, and ensures compatibility with various 
methods and protocols for communicating QoS requests between end-user applica- 
tions and the Resource Control Layer. It provides also scalable mechanisms for appli- 
cations, which produce a large number of short-lived sessions with unclear require- 
ments or where dedicated reservations are inappropriate, and it offers control mecha- 
nisms for the support of bi-directional services. 

• QoS Management Tool 

In order to deploy and operate QoS mechanisms in large scale IP networks, the 
project develops a QoS Management Tool (QMTool) that 

allows high level management of user related QoS policies, 
enables the interaction among different ISPs and administrative domains, 
foresees the interworking of different mechanisms (i.e. IntServ / DiffServ / 
MPLS / traffic engineering). 

The current lack of tools to handle these complex issues hinders deployment of 
QoS into operational IP based networks. The QMTool simplifies the task of a service 
provider in operating QoS aware networks. 

• Trajfic and Admission control. Traffic engineering 

The innovative architectural concepts require novel algorithms for traffic control, 
admission control and traffic engineering in order to optimize the network perform- 
ance and achieve efficient usage of the network resources. A goal of the project is to 
define a clear reference framework to operate with the different facets of traffic han- 
dling in a consistent way. Traffic handling is composed of: 

Provisioning and Traffic Engineering procedures operating at 
hours/days/weeks time scale 

Admission Control functions operating at seconds/minutes time scale 
Packet level traffic control that includes all the mechanisms for classifying, 
marking, scheduling and policing the IP packets (it operates at milliseconds 
time scale). 

The AQUILA project has undertaken a two phased approach in the definition of 
the architecture and in its development and trials. The first phase focused on single 
domain QoS, while no focus has been put on the QMTool and the L‘ trial has been a 
lab trial. The first phase now almost completed and most results from L* trial are 
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available. In sections 2, 3 and 4 the definition of AQUILA first phase architecture is 
given. Section 5 discusses the trial results, while section 6 gives some information on 
the ongoing work for the definition of the second phase architecture. 



2 AQUILA QoS Architecture 

The current Internet architecture is not designed to support QoS, and there exist dif- 
ferent approaches for providing QoS over IP-hased networks. Due to the different 
underlying mechanisms of these approaches and the complexity of end-to-end QoS, 
there is currently no solution suitable for global operation. Integrated Services, Dif- 
ferentiated Services, Multi Protocol Label Switching, QoS Routing, Bandwidth over- 
provisioning are among the most important techniques for QoS in IP networks. For a 
comprehensive discussion on the IP QoS, seelf^. 

Most of the above-mentioned technical solutions on how to bring QoS into IP net- 
works are still under discussion. Some of them are divergent, while some are com- 
plementary. No integrated scaleable solutions are available right now. Furthermore, 
management and interoperability aspects of the mentioned approaches are currently 
treated poorly. The project assumes the DiffServ architecture as the most promising 
starting point for its work. The project develops extensions of this architecture in 
order to avoid the statically fixed pre-allocation of resources to users. Dynamic adap- 
tation of resource allocation to user requests is enabled in a way that keeps the overall 
architecture scalable to very large networks. As an example for an alternative ap- 
proach, the RSVP protocol defined in the Integrated Services architecture should be 
present in the access network where the load is small and the scalability issue is not 
important. 



2.1 The AQUILA Resource Control Layer 

The Resource Control Layer (RCL) is an overlay network on top of the DiffServ core 
network E- The RCL mainly has three tasks, which are assigned to different logical 
entities: 

To monitor, control and distribute the resources in the network. This task is as- 
signed to the Resource Control Agent (RCA). 

To control access to the network by performing policy control and admission 
control. This task is assigned to admission control agents (ACA). Each edge 
router or border router is controlled by an ACA. As each access request necessar- 
ily means usage of resources, the RCA may be directly or indirectly involved in 
handling admission requests. 

To offer an interface of this QoS infrastructure to applications. This task is as- 
signed to the End-user Application Toolkit (EAT). Erom the network point of 
view the EAT acts as an RCL front-end. Erom the user point of view, the EAT 
provides a QoS portal. 



678 B. Koch and S. Salsano 



The entities defined abov e are a ssociated to network elements within the underly- 
ing domain as shown in the iFig- ll An EAT instance can be responsible for a single 
host as well as for a set of hosts. The latter might be the case, when not a single host, 
but a whole sub-network is connected to an edge router. The resource control layer 
assumes an underlying DiffServ network. The DiffServ code points (DSCP) and the 
PHBs of this network are assumed to be predefined by management. They are not 
under control of the RCL for the T“ trial of the project AQUILA. For each traffic class 
however, there is a specific amount of bandwidth available in each link of each edge 
router, border router or core router. So bandwidth is the main resource, which is han- 
dled by the RCL. 



RCL entities Network entities 




Fig. 1: Mapping of RCL entities to the underlying network entities 

In the first phase of the project (T“ trial), the RCL implements a dynamic admis- 
sion control by distributing the pre-configured, static resources of the core network 
among edge routers and border routers. In the 2”** trial, dynamic reconfiguration of 
core network resources is also taken into consideration. 

In the following, a single DiffServ domain is assumed. Inter-domain aspects are 
not covered as they will be handled in the second phase of the project. 



2.2 Resource Control Agent 

The support of QoS, as proposed above, leads to the introduction of a new logical 
layer, the Resource Control Layer (see Fig. 2). The Resource Control Layer provides 
an abstraction of the underlying layers. A node in the Resource Control Layer is 
called a Resource Control Agent (RCA) and represents a portion of the IP network, 
which has internally the same QoS control mechanisms. An RCA is a generalization 
of the concept of the Bandwidth Broker in the DiffServ architecture. RCAs are logical 
units that run on several physical configurations, e.g. one server per RCA or several 
RCAs co-located on one server. The QoS control mechanisms used in the underlying 
network are of varying nature, e.g. in some part the routers may not even support 
DiffServ (which means that there is only a trivial best-effort QoS control), while in 
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other parts they may be DiffServ capable. Moreover, some parts of the network may 
allow dynamic reconfiguration of resources, e.g. by adding ATM connections, others 
may have a more or less fixed configuration, e.g. pure SDH or WDM sub-networks. 
Another reason for the introduction of separate RCAs is that sub-networks are do- 
mains managed by different operators. 




ISP Domain 



Fig. 2. AQUILA QoS Architecture: the Resource Control Layer 



A Resource Control Agent is able to observe and in some sense to influence the ac- 
tual configuration in the network portion it represents. Configuration parameters may 
describe the fraction of a network connection devoted to a specific DiffServ traffic 
class or the existence of a virtual connection (in ATM networks) with a specified 
bandwidth. For this administration purpose, a mechanism is required to access IP 
routers and possibly other network elements. 

The RCAs employ distributed computation that adapts the network according to 
user requests for QoS. The RCAs always try to establish a situation where the net- 
work configuration is “slightly over-dimensioned” such that user requests can be 
immediately satisfied by just checking the admissibility of the user and recording the 
additional resource usage. As soon as some “watermarks” are reached, the RCAs start 
a dynamic reconfiguration process in order to avoid congestion. The adaptation algo- 
rithm of an RCA is the reason why it is called an “agent” (even an intelligent agent) 
since the RCA can act autonomously, in contrast to admission control which is often 
non-local. For the purposes of a first prototype, locally fixed agents (which commu- 
nicate over CORBA) seem to be sufficient. However, it is basically feasible to make 
use also of mobile agents here e.g. to move the “master” agent of an Internet back- 
bone provider to a different physical location when its current home gets overloaded. 

The RCA must also support flexible accounting schemes for QoS services, includ- 
ing both end-user-ISP and ISP-ISP accounting. In addition, the RCA interacts with 
network management, firstly for configuration and secondly for partial network fail- 
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ures. For configuration and service creation, the RCA interacts with the QMTool, 
which offers integrated and flexible service control. 



2.3 Admission Control Agent 

A DiffServ network can only provide quality of service, if it is accompanied by an 
admission control, which limits the amount of traffic in each DiffServ class. Admis- 
sion control checks, whether the resources requested from a user are available in the 
network and admits or rejects the request. 

The AQUILA architecture uses a local admission control located in the Admission 
Control Agent (ACA), which is associated with the ingress and egress edge router or 
border router. To enable the ACA to answer the admission control question without 
interaction with a central instance, the RCA will locate objects representing some 
share of the network resources nearby the ACA. Resources are assigned to these ob- 
jects proactively. For the ACA, these objects represent a “consumable Resource- 
Share”. 

Admission control can be performed either at the ingress or at the egress or at both, 
depending on the reservation style. 

The ACA will just allocate and de-allocate resources from its associated consum- 
able ResourceShare. The ACA is not involved in the mechanisms used by the RCA to 
provide this resource share, to extend and to reduce it. 

Resource distribution is performed on a per DiffServ class basis. In the 1“ trial, 
there is no dynamic reconfiguration of DiffServ classes. So, the resources of each 
class can be handled separately and independently of each other. This per class distri- 
bution however is not appropriate for edge devices, which are connected via small 
bandwidth links to the core network. In this case, additional mechanisms apply. 

Resources are handled separately for incoming traffic (ingress) and for outgoing 
traffic (egress). The following description of resource distribution applies to both. 

Resource distribution is performed by the RCA in a hierarchical manner using so- 
called Resource Pools. For this purpose it is assumed, that the DiffServ domain is 
structured into a backbone network, which interconnects several sub-areas. Each sub- 
area injects traffic only at a few points into the backbone network. This structuring 
may be repeated on several levels of hierarchy. 

When considering the resources in the backbone network, all traffic coming from 
or going to one sub-area can be handled together. So it is reasonable to assign a spe- 
cific amount of bandwidth (incoming and outgoing separately) to each sub-area. 



2.4 End-User Application Toolkit 

The End-user Application Toolkit (EAT) is an application that aims to provide access 
to end-user applications to QoS features. The EAT is a middleware between the end- 
user applications (for example a video conferencing tool or a video-on-demand ser- 
vice) and the network infrastructure (for example the AQUILA network). 
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The tasks of the EAT are to allow legacy applications (QoS-aware and non-QoS- 
aware) to benefit from QoS features (E‘ trial), and to allow the implementation of 
QoS-aware EAT-based applications by making use of an API (2” trial). 

In particular, the EAT 

enables the construction of QoS-aware applications as well as the migration of 
legacy applications to QoS-awareness, 

ensures compatibility with various methods and protocols for communicating 
QoS requests between end-user applications and the Resource Control Layer. 
Special schemes for integrating application level protocols (e.g. H.323, SIP) with 
QoS control are taken into consideration, 

provides scalable mechanisms for applications, which produce a large number of 
short-lived sessions with unclear requirements (e.g. WWW) and where dedicated 
reservations are inappropriate, 

offers control mechanisms for the support of bi-directional services. This en- 
hances the DiffServ approach that supports only simple uni-directional flows. 

The EAT is transparent for legacy applications, but will be mostly transparent for new 
EAT-based applications (using the API). 



3 Service Level Specifications, Network Services 
and Traffic Classes 

Two important aspects of QoS are QoS guarantees and QoS differentiation. 

In order to provide QoS differentiation, a limited set of Network Services have 
been defined in the AQUILA project, which represent the services sold by the pro- 
vider to its customers: Premium Constant Bit Rate (PCBR), Premium Variable Bit 
Rate (PVBR), Premium Multimedia (PMM), Premium Mission Critical (PMC) and 
Standard Best Effort (STD). Each Network Service is meant to support a class of 
applications with substantially similar requirements and characteristics. The Network 
Services are internally mapped by the operator into a set of Traffic Classes. The Traf- 
fic Classes use DiffServ based packet handling mechanisms and are defined in term 
of queuing and scheduling mechanism in the routers. More details on the set of traffic 
classes defined by AQUILA will be discussed in the next section. 

In order to provide QoS guarantees an ISP must somehow regulate the amount of 
traffic entering the network regarded as a limited set of resources. In the AQUILA 
approach this is accomplished by the distributed Resource Control Layer, whose 
architecture has been defined in the previous section. The RCL embeds different 
mechanisms to regulate the traffic at different time-scales - Initial Provisioning, Dy- 
namic Resource Pool and Admission Control. These mechanisms will be presented in 
the next section as well. As already mentioned, in the first phase of the AQUILA 
project the focus is on QoS in a single domain. 

An important aspect of the QoS provisioning is the definition of the agreement be- 
tween the user and the provider about the scope of the QoS contract: which flows 
should receive QoS, which are their traffic characteristics, which is the expected QoS 
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level. The specification of this agreement is commonly referred to as “Service Level 
Specification” - SLS. It has been recognized that a common standardized way to 
express the semantic of the SLS could be useful in an open mu lti-provider environ- 
ment. Some work has been produced in this direction (|^, IllOl i and related discus- 
sion within IETF is ongoing. If a “static” QoS provisioning approach is envisaged the 
agreement is negotiated “off-line” between the user and the provider, may be involv- 
ing human intervention. A formal SLS can be useful to have a clear and commonly 
understood picture of the service and of the required QoS. If a dynamic approach is 
used, where the user application can automatically send QoS requests to the network, 
the SLS should also be mapped into signaling information exchanged by QoS aware 
elements. This is the approach followed by the AQUILA project, where the EAT 
sends their reservation request messages to the A CA, s pecifying an SLS. The seman- 
tic definition of the AQUILA SLS is described in 111 11 . The reservation request mes- 
sages, as all the AQUILA control messages, are transported using CORBA. 



4 Traffic Handling Approach 



This section focuses on the traffic handling mechanisms in AQUILA. Traffic han- 
dling is used here as a general term for a set of coordinated mechanisms that operate 
at different time scales: 

Traffic control refers to the mechanisms operating at milliseconds time scale 
like packet scheduling, policing, queue management. 

Admission control refers to the algorithms to decide about the acceptance of a 
new flow in the network, operating at the time scale of seconds to tens of min- 
utes. 

Resource Pools refers to the algorithm for short-term resource redistribution, 
to cope with local fluctuations in offered traffic, operating at the time scale of 
tens of minutes to hours. 

Provisioning refers to the algorithm for medium/long term resource allocation 
and redistribution, operating at the time scale of hours to days. 



The relationships among these logical components (Provisioning, Resource Pools, 
Admission Control and Traffic Control) are described hereafter. A very hi gh-level 
view of the p rocess that enables QoS in the AQUILA architecture is given in Fig. 3.| 
while no gives a simplified pictorial view of the relationships between the differ- 
ent mechanisms. 

The Provisioning phase is run off-line before the network operation, and gives the 
required input to the RCL elements as well as configuration values for setting the 
router parameters. The initial provisioning algorithm takes as an input global informa- 
tion about the topology, the routing (costs of links), the expected traffic distribution 
between Edge Routers for each Traffic Class (TCL), and any further constraints on 
the link bandwidth sharing between TCLs. It performs a sort of global computation 
and produces as output: 
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the expected amount of traffic for each Traffic Class on each link, called 
provisioned rate. This is used for the router configuration, i.e. to chose the 
appropriate setting for the scheduling / queuing parameters (WFQ weights, 
WRED thresholds) at each router interface. 

the Admission Control Limits for each Traffic Class at each Edge Router, 
which are used by AC algorithms during the operation phase. 

Definition of the Resource Pools sets. 



Expected / Measured 
Traffic 




The Traffic Control mechanisms define how the packets of the different classes are 
handled by the Edge and Core Routers in the AQUILA network. They includes traffic 
conditioning (also referred to as policing), that is enforced at ingress ERs only, and 
scheduling / queuing algorithms, implemented at any router interface. 

The configuration of the scheduling / queuing mechanism is “static”, i.e. the rele- 
vant parameters are configured in the routers at start up. An off-line procedure com- 
putes these parameters starting from the provisioned rates produced by the Initial 
Provisioning algorithm. Obviously, the configuration of per-flow traffic conditioning 
parameters at the ingress edge is done run-time according to the admitted requests. 

The Admission Control procedure is intended to restrict traffic in order to avoid 
congestion. The AC procedure is operated on-line, but the AC reference limits, or AC 
Limits, are computed during the off-line initial provisioning phase and configured 
during the start-up phase. 

The assignment of AC Limits to each Edge Router for each TCL represents a re- 
source assignment to the relevant traffic aggregates. As the AC Limits are computed 
based on the expected offered traffic at each ER, some deviation can occur during the 
operation phase between the actual offered traffic and the resource distribution be- 
tween ERs. The Resource Pools mechanism represent a way to dynamically change 
the AC Limit to some extent, so as to dynamically track short term fluctuations in 
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traffic requests. Such mechanisms are based on the concept of RP, which are sets of 
Edge Routers that can exchange resources with each other. Such sets are defined 
during the Initial Provisioning phase. 



Constraints 
on Traffic Class 
usage per Link 
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Fig. 4. Initial Provisioning, Res. Pools, Admission Control and Traffic Control 



4.1 Traffic Classes and Packet Level Traffic Control 



AQUILA has defined a set of five Traffic Classes. Each TCL is associated a different 
queue in the router output interface, and a bandwidth portion on each link. The queue 
dedicated to TCL-1 i s served with strict priority over the others. All queues are served 
by a WFQ scheduler. Eil ^shows the inter-TCL scheduling scheme. 

TCL-5 is intended to support the Standard Service (STD), i.e. the traditional best- 
effort traffic. The traffic accessing the STD service is not delivered any QoS and is 
not regulated by any AC and/or policing function inside the ER. Nevertheless, a non- 
null amount of bandwidth will be guaranteed to this traffic on each link. 

TCL-1 and TCL-2 are intended to support non-reactive (open loop) traffic with 
stringent QoS requirements. In particular TCL-1 will be characterized by very high 
QoS performance (very low delay and very low losses), accomplished by a conserva- 
tive AC scheme. In the AQU ILA architecture TCL-1, which is somehow similar to 
the EE PHB defined in 11121 will exclusively support the PCBR service. Typically, 
TCL-1 will be entered by flows originated by real-time streaming applications like 
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VoIP, etc. On the other hand, TCL-2 will deliver a lower QoS level (low delay and 
low losses) to those streaming application with high emission rate variability and/or 
large packets. TCL-2 will mainly support the PVBR network service. A “severe” 
purely dropping traffic conditioning at ingress point is associated to TCLs 1 and 2, 
i.e. all packets exceeding the declared profile are discarded. The traffic profile for 
TCL-1 is described in terms of a Single Token Bucket, limiting the flow peak rate. 
The traffic profile for TCL 2 is described in terms of a Dual Token Bucket, control- 
ling both the peak and mean rate of the flow. 



TCL 1 





Fig. 5. Design of router output port 

TCL-3 and TCL-4 are dedicated to reactive flows (TCP and TCP-like). In particu- 
lar, TCL-3 will support PMM service and serve long-lived TCP connections (for long 
file transfers) or other adaptive application flows (audio/video download, adaptive 
video). A Single Token Bucket descriptor is used to declare the mean rate only. Traf- 
fic conditioning at ingress point is based on 2 colors marking: out-of-profile packets 
are not discarded but simply marked as such with a different DSCP value. At router 
interfaces, the TCL-3 queue uses a WRED management algorithm with two different 
sets of parameters for in-profile and out-of-profile packets. TCL-4 instead will sup- 
port PMC service and will receive non-greedy elastic flows, typically short-lived TCP 
connections originated by some critical transaction application (e.g. finance) or inter- 
active games. Dual Token Bucket is used descriptor is used to declare both mean and 
peak rate. Traffic conditioning and queue management are similar to TCL-3. 



4.2 Admission Control 

At each Edge Router (ER), each TCL is assigned a bandwidth value, which is used to 
limit the maximum amount of traffic that the ER can inject into the network for the 
specific TCL. This value, referred to as “AC Limit”, will be used by the Admission 
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Control algorithms to decide about the acceptance of new flows. Different admission 
control algorithms have been defined for each different traffic class. The AC algo- 
rithms are partly derived from the results developed in the context of ATM traffic 
control. The TCL-1 class uses peak rate allocation scheme. In case of TCL-2 traffic 
class the REM {Rate Envelop e Mu ltiplexing) multiplexing scheme is assumed for 
guaranteeing low packet delay EH . In case of TCL-3 each flow is characterized by 
parameters of single token bucket algorithm that correspond to the sustained bit rate 
(SBR) and the burst size (BSS). One has to consider that the traffic flows submitted to 
TCL-3 class are TCP-controlled and rough QoS guarantees can be acceptable. The 
admission control procedure will only check that the sum of sustainable rates declared 
by the sources is less than the AC Limit. In case of TCL-4, a flow is characterized by 
parameters of dual token bucket algorithm. The proposed admission control algorithm 
evaluates an effective bandwidth and then checks that the sum of the effective band- 
widths is less than the AC limit. The goal is to provide a very low packet loss rate for 
in profile packets. A complete description of the Admission Control algorithms can 
be found inFsi. 



5 QoS at Work: Prototype Implementation and Trial Experiences 

As mentioned earlier the project follows a two phased approach, each containing the 
full circle of prototypical software development with design, implementation, integra- 
tion and trial. The first phase ending with the 1st trial was meant to give early results 
on the experimental verification of the AQUILA concepts and the operability of the 
proposed architecture. Therefore the 1st trial was planned to run as a lab trial in three 
different operator sites, each focussing on different configuration aspects and network 
topologies, and following different trial scenarios. 

During the first year of the project the QoS IP network architecture was defined 
and prototypes of the ACA, RCA and EAT were implemented. In addition a Distrib- 
uted Measurement Architecture (DMA) was developed in AQUILA Q for generation 
of foreground traffic, active network probes and collection of QoS monitoring infor- 
mation from the routers. It is designed to provide the validation of the end-to-end 
QoS provision in AQUILA based on mappings between the measured end-to-end- 
QoS, the used network service and the end-to-end QoS that is required by the user. 

It also offers monitoring of QoS information (e.g. packet loss) for the resource 
control layer. 

This measurement system consists of two main parts, the measurement server and 
measurement clients. The measurement server is situated in the core of the network, 
while the measurement clients are distributed in the network leafs. Lor one-way delay 
measurements, the clients are equipped with GPS hardware for time synchronization. 

After successful integration of all components that were developed at various part- 
ner sites all over Europe - the consortium consists of 12 partners out of six European 
countries - the 1st trial experiments were prepared and performed in three partner 
sites, where the main site is located in Warsaw, and two others are in Vienna and 
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Helsinki III. The following figure (Fig. 5) shows the configuration of the Warsaw 
testbed as an example. 

These testbeds differ in network configurations, main focus of test scenarios and in 
the number of routers. Appropriate software developed inside the AQUILA project 
was integrated in these testbeds. In particular, EAT, RCA, ACA and measurement 
tools were integrated with router configurations. 

The trial experiments are mainly focused on the evaluation of previously defined 
network services providing QoS: Premium CBR (PCBR), Premium VBR (PVBR), 
Premium Multimedia (PMM) and Premium Mission Critical (PMC). For the purpose 
of the AQUILA demonstrator, testing of chosen Internet applications is also included 
in the plans. 

The following major objectives for the trials were defined: 

In Warsaw: two network services, PCBR and PMM. A mixture of network ser- 
vices is also being tested in order to check the ability of the AQUILA architec- 
ture for providing service differentiation. Additionally, the correctness of admis- 
sion control as well as resource pool algorithms is verified. The experiments are 
provided under artificially generated traffic and real Internet applications. 

In Vienna: two network services, PVBR and PMC. 

In Helsinki: measurements of the performance of the RCL. 

Several trials with the pre-defined network services were performed, either using a 
single service or a combination of services. Details of the trial results can be found in 
the First Trial Report FtI . 




Fig. 5. Configuration of the Warsaw testbed 
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The - still ongoing - trial experiments allowed to verify the AQUILA architecture 
concept. The introduction of different network services for serving streaming and 
elastic traffic is justified. The matching of PCBR for streaming traffic, of PVBR for 
video conferencing and of PMM for elastic traffic is verified. It is under consideration 
the need for the second class for elastic traffic (PMC). The project is looking for real 
application that can benefit of this class. The need of admission control mechanisms 
for providing QoS is justified and the effectiveness of implemented admission control 
mechanism was tested. In particular it was found that the utilization of the network 
was less than expected, i.e. the admission control algorithms are too conservative. 

Another lesson learned from implementation and trial experience is that it is diffi- 
cult to provide correct Traffic Parameters for the different applications. Together with 
the previous observation about a possible inefficiency of admission control algorithm, 
this suggests to improve the overall architecture with input from measurements. This 
input can improve the admission control efficiency - introducing some concepts from 
measurement-based approach - and can relieve the application from the need to pro- 
vide complex set of Traffic Descriptor parameters. 



6 Current Work 



For the second phase of the project the encouraging results of the F‘ trial will be fed 
back to the specification and design of the enhanced architecture elements. This will 
include in particular the feedback from measurements in the Traffic handling proce- 
dures. Focus of the second phase will be also on the Inter-domain aspect and on the 
running of the 2”“* trial involving end-users. In this section we will briefly deal with 
Control Loops and we will provide some comments on the Inter-domain aspects. 

The logical process for QoS provisioning as defined in section 4 can be seen as an 
open loop from Pro visionin g to configuration of Resource Pools, Admission Control 
and Traffic Control. |Fig. 6 gives a pictorial representation of the inclusion of control 
loops in the AQUILA architecture. The only point where a feedback from the actual 
network status is included in the F‘ trial architecture is the dynamic adapt ation o f 
Resource Pools according to reservation request messages (feedback “A” in |Fig. 6| l. 
The time scale for this control loop is in the order of minutes/hours. 

Taking into account the input from “on-line” measurement two additional control 
loops can be envisaged. One is aimed at improving efficiency in network utilization 
by enhancing Admission Control functionality. The Declaration Based admission 
control used in the F‘ trial will be enhance d with concepts of Measurement Based 
Admission Control (feedback “B” in pig- The time scale of operation of this con- 
trol loops is in the order of few seconds. 

The second additional control loop is referred to the initial provisioning phase. 
This phase is bases on estimates of traffic matrix. The measureme nt of ac tual traffic 
can obviously be used to tune the provisioning (feedback “C” in Fig. 6] . The time 
scale of this control loop is in the order of hours/days. 
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Fig. 6 . Control loops in the AQUILA architecture 



As for the problem of Inter-domain QoS, it will be tackled by trying to define effi- 
cient and scalable mechanisms for aggregating Inter-domain resource reservation and 
for conveying the needed signaling informati on a cross the domains. The basic ap- 
proach is derived from the BGRP proposed in ll~14l . which will be adapted to suit the 
AQUILA needs. 



7 Conclusions 

In this paper the results of the ongoing 1ST project AQUILA have been discussed. 
The AQUILA project has defined and implemented a QoS architecture for IP net- 
works. It exploits DiffServ-like functionality and introduces a Resource Control 
Layer for requesting and managing QoS resources. A stepwise approach in the devel- 
opment has been undertaken. Lab trials related to the first phase have been carried 
out. The benefits of the proposed architecture have been investigated and consider- 
able experience has been gained. Work is now ongoing on the specification of the 
second phase. 




690 B. Koch and S. Salsano 



References 

[1] 1ST project AQUILA deliverable D002 “Project Presentation”, April 2000, 
http://www-st.inf.tu-dresden.de/aquila (Publications) 

[2] X. Xiao, L.M. Ni “Internet QoS: A Big Picture”, IEEE Networks, March 1999 

[3] W. Zhao, D. Olshefski and H. Schulzrinne, “Internet Quality of Service: an Overview” 
Columbia University, New York, New York, Technical Report CUCS-003-00, Eeb. 2000. 

[4] 1ST project AQUILA deliverable D1201 “System architecture and specification for first 
trial”, June 2000, http://www-st.inf.tu-dresden.de/aquila (Publications) 

[5] 1ST project AQUILA deliverable D1301 “Specification of traffic handling for the first 
trial”, July 2000, http://www-st.inf.tu-dresden.de/aquila (Publications) 

[6] 1ST project AQUILA deliverable D2301 “Report on the development of measurement 
utilities for the first trial”, September 2000, http://www-st.inf.tu-dresden.de/aquila (Pub- 
lications) 

[7] 1ST project AQUILA deliverable D3201 “Eirst Trial Report”, June 2001, 
http://www-st.inf.tu-dresden.de/aquila (Publications) 

[8] B. Koch “A QoS architecture with adaptive resource control - The AQUILA approach” 
Interworking'2000 (Eifth International Symposium on Interworking), Bergen, Norway, 
October 3-6, 2000 available at http://www-st.inf.tu-dresden.de/aquila (Publications) 

[9] Proposed Chapter for SLSU WG (version 1) — 23/02/01; http://www.ist- 
tequila.org/slsuwgv 1 .txt 

[10] Y. T'Joens et al, “Service Level Specification and Usage Framework”, draft-manyfolks- 
sls-framework-00.txt, October 2000; 

[11] S. Salsano et al, “Definition and usage of SLSs in the AQUILA consortium”, draft- 
salsano-aquila-sls-00.txt”, November 2000; http://www-st.inf.tu-dresden.de/aquila/ 

[12] B. Davie et al., “An Expedited Forwarding PHB”, Internet Draft, draft-ietf-diffserv- 
rfc2598bis-01.txt, April 2001. 

[13] Final report COST 242, Broadband network teletraffic: Performance evaluation and 
design of broadband multiservice networks (J. Roberts, U. Mocci, J. Virtamo eds.). Lec- 
tures Notes in Computer Science 1155, Springer 1996. 

[14] “BGRP: A Tree-Based Aggregation Protocol for Inter-domain Reservations”, P. Pan, E, 
Hahne, and H. Schulzrinne, Journal of Communications and Networks, Vol. 2, No. 2, 
June 2000, pp. 157-167 




An Adaptive Periodic FEC Scheme 
for Internet Video Applications 



Tae-Uk Choi, Myoung-Kyoung Ji, Seong-Ho Park, and Ki-dong Chung 



Department of Computer Science, Pusan National University, 
Kumjeoung-Ku, Pusan, South Korea 

{ tuchoi , bluesky , shpark, kdchung) ©melon . cs . pusan . ac . kr 



Abstract. When transmitting packets compressed with a high compression 
standard such as MPEG or H.261, the loss of single packet has a considerable 
effect on the following frames because of motion estimation and compensation. 
There are many techniques that prevent this error propagation. We classify the 
video error control techniques into codec-level and network-level schemes and 
investigate the effect of various combinations. As the result, we propose a new 
EEC-based video error control scheme, Periodic FEC, which provides error re- 
silience, adaptability to network conditions and the ability to combine with 
other scheme. Through experiments, it is confirmed that the Periodic FEC 
scheme is superior to other schemes such as FEC, RPS etc, and the perform- 
ance of this scheme can be maximized when combined with other schemes. 



1 Introduction 

Real-time video transmission over the Internet is very challenging. In the current 
Internet, because network loss and delay are variable and bandwidth is limited, the 
QoS of video applications is not guaranteed and video transmissions require high 
compression efficiency. Video compression standards such as MPEG and H.261 are 
not designed for transmission over a lossy channel. Although they can achieve very 
impressive compression efficiency, even a small amount of data loss can severely 
degrade video quality. Namely, because the codec uses motion estimation and com- 
pensation to remove temporal redundancy in a video stream, packet loss in a frame 
can be propagated to the subsequent frame and get amplified. 

Many video error control techniques have been proposed for the prevention of er- 
ror propagation [1][2][3]. The simplest approach is to use intra-coded frames at peri- 
odic intervals. But, this clearly has large bandwidth requirements. Another approach 
is to intra-code and transmit only those blocks in a frame that change more than some 
threshold. Such a process is referred to as conditional replenishment and is used in nv 
[2] and vie [3]. 

We classify these video error control techniques into codec-level and network-level 
schemes. The codec-level scheme is a scheme that is implemented within an encoder 
and a decoder, and the network-level scheme is a scheme that can be implemented 
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only at the transmission level without codec’s help. ET (Error Tracking) [4] and RPS 
(Reference Picture Selection) [5] are representative schemes of the codec-level. Also, 
retransmission and EEC (Eorward Error Correction) are two major schemes at the 
network-level. 

These schemes have their own strengths and weaknesses. To improve the perform- 
ance, the schemes at one level can be combined with the techniques at another level. 
For example, EEC and RPS is a combination that provides good performance. When 
EEC cannot recover a lost frame due to long burst losses, RPS can prevent the loss 
from propagating to its successive frames. However, this combination still has its own 
weaknesses of the additional bandwidth overhead by EEC and supplementary buffer 
space by RPS. 

After analyzing various combinations, we propose a new EEC-based error control 
scheme, called Periodic EEC, which provides error resilience with a smaller band- 
width than that of EEC and stops error propagation without a feedback channel. 
Moreover, the scheme can dynamically adjust the redundant information depending 
on the network loss rate, and its performance can be maximized when other feedback- 
based error control mechanisms are combined. 

Through experiments, we show the effectiveness, adaptability and combinableness 
of the Periodic EEC. As the result. Periodic EEC scheme is superior to other schemes 
such as EEC, RPS etc and the performance can be maximized when it is combined 
with RPS. 

The rest of this paper is organized as follows. Section 2 presents an overview of re- 
lated works. Section 3 describes the Periodic EEC scheme and its combination 
schemes. Section 4 presents the effectiveness of Periodic EEC and its combination 
schemes through experiments. Section 5 presents the conclusion. 



2 Related Works 

Error control schemes can be classified into three types. The first type consists of 
codec-level error control schemes that are implemented at the coding level. The sec- 
ond type consists of network-level error control schemes that are implemented in at 
the transmission level. The third type consists of techniques that combine codec-level 
and network-level techniques. The following describes works related to each level. 



2.1 Codec-Level Schemes 

Recently, H.263-H incorporated two feedback-based error control techniques: error 
tracking (ET) and reference picture selection (RPS) [6]. ET requires the encoder to 
know the location and extent of erroneous image regions in displayed images. This 
scheme requires feedback messages from the receiver. The receiver sends information 
about missing packets and the encoder estimates the region of error propagation in the 
displayed images and intra-codes the blocks contained the region. This scheme is 
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attractive since it does not require any modifications of the bit stream syntax of the 
motion-compensated coder. 




Receimg site 



Fig. 1. Illustration of a typical reference picture selection 

RPS allows the encoder to select one of several previously decoded frames as a 
reference picture for motion estimation. It is designed to support a coding technique 
called NEWPRED [5] [7]. Eig. 1 illustrates the basic operation of a typical reference 
picture selection. When the receiver does not correctly decode a picture, it notifies the 
transmitter. Based on this notification, the transmitter determines the picture it will 
select as a reference. As shown in the figure, an error in picture 2 is propagated up to 
picture 3 because each picture refers to the previous picture or the last decoded pic- 
ture. However, picture 4 refers to picture 1, which is correctly decoded. Thus, the 
receiving side eliminates error propagation without additional bandwidth overhead. 

ET and NEWPRED have a limitation in that they should modify their picture cod- 
ing patterns based on specific information about lost packets. Thus, continuous feed- 
back from a receiver is essential for their performance. 



2.2 Network-Level Schemes 

Retransmission and EEC are the two major error control schemes at the network- 
level. The former can provide good error resilience without incurring much band- 
width overhead because packets are retransmitted only when there are indications of 
packet loss. Because retransmission always involves additional transmission delay, it 
is ineffective for interactive real-time video applications. However, it can be still a 
very effective technique for improving error resilience in interactive real-time video 
applications. In [8], a new retransmission-based error control technique is proposed 
that effectively alleviates error propagation in the transmission of interactive video. 

In the latter case, redundant information is transmitted along with the original in- 
formation so that the lost original data can be recovered based on the redundant in- 
formation. This scheme is attractive because it provides resilience to loss without 
increasing latency. Many EEC-based schemes involve exclusive-OR operations [9]. 
These increase the send rate of the source by a factor of 1/k, and they add latency 
since k packets must be received before the lost packet can be reconstructed. Bolot 
and Turletti [1] proposed an interesting FEC scheme for packet video where a packet 
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contains the redundant information of some previous packets. The redundant informa- 
tion is created by encoding the image blocks contained in the previous packets with a 
large quantization step. They claimed that if the video source is not bursty and long 
burst losses are rare, and then their scheme would work well for video. 



2.3 Combined Schemes 

There are some schemes that combine codec-level and network-level schemes. In 
[10], packets arriving after their display times are not discarded but instead used to 
reduce error propagation. Although a periodic frame may be displayed with errors 
because of some loss of its packets, the errors will stop propagating beyond the next 
periodic frame because these losses can be recovered within a PTDD (Periodic Tem- 
poral Dependency Distance). When this scheme is combined with retransmission or 
FEC, the performance is effectively improved. However, under high motion scenes, it 
would be less effective, and it would need extra computation and buffer space at the 
receiver side. 



3 Combined Error Control Schemes 

3.1 Simple Combined Schemes 

Simple combination schemes are considered in order to discover the effects of various 
combinations. When RPS or ET scheme of codec-level is combined with the retrans- 
mission scheme of the network level, the two are not complementary to each other 
because all of the combinations utilize a feedback-based mechanism. These schemes 
may require much delay time to retransmit the lost packet and wait the feedback mes- 
sage for RPS or ET. FEC, however, can combine with RPS or ET to improve error 
resilience because the two are complementary to each other. If EEC fails to recover 
the lost packet at the transmission level, RPS or ET will stop error propagation at the 
coding level. In the combination of FEC and ET, the bandwidth overhead may be 
large because of the redundant information of EEC and intra-coded blocks of ET. The 
combination of EEC and RPS is a good combination because when a lost frame is 
unrecoverable by PEC, RPS can eliminate error propagation with a low overhead. 

To study the effectiveness of various combinations, we investigate a simple combi- 
nation scheme. Pig. 2 shows the combination of PEC and RPS, which RPS is used in 
the codec-level, and RES-based FEC scheme is used in the network-level. A Reed- 
Solomon erasure correcting code (RSE code) [11] is a commonly used PEC encoder 
where k source packets of P bits are encoded into n{>k) packets of P bits (namely, k 
data packets plus n-k parity packets). This group of n packets is called an PEC block. 
The RSE decoder on the receiver side can reconstruct the source data packets using 
any k packets out of its FEC block. As shown in Pig. 2, the RSE decoder fails to re- 
cover frame 3 because three packets are lost in network. Thus, the receiver sends a 
NACK message to the sender. Based on these feedback messages, the encoder on the 
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sender side modifies the reference of frame 4 to depend on frame 2. Thus, the error 
propagation due to the loss of frame 3 can be stopped. Consequently, while FEC 
helps reduce the occurrences of independent losses at the network level, RPS stops 
error propagation when losses are unrecoverable at the network level. However, since 
FEC needs additional bandwidth and RPS requires additional buffer space, this com- 
bination still has some weaknesses. 




Fig. 2. Combination of RPS and PEC with a (5, 3) RSE code 



3.2 Periodic FEC 

Through the analysis of the simple combined schemes, we found that a new scheme 
should have the following features: Firstly, error propagation should be prevented and 
the redundant information should be minimized. Secondly, when it is combined with 
other schemes, the strengths should be maximized while weaknesses should be mini- 
mized. Lastly, it should adapt to the network condition to control the overhead. 

Considering these features, we design a new FEC-based error control scheme: Pe- 
riodic PEC. This scheme consists of the frame reference method in the encoder and 
the transmission method in the network. As shown in Pig. 3, every p-th frame is re- 
ferred to as a periodic frame, and frames between two consecutive periodic frames are 
referred to as nonperiodic frames. Pig. 3 (a) shows that a periodic frame depends on 
its previous periodic frame for motion estimation, while every nonperiodic frame 
depends only on its immediately preceding periodic frame. Thus, if periodic frames 
are received safely, errors in nonperiodic frames do not propagate. For the safe 
transmission of periodic frames, the conventional FEC scheme is used. The redundant 
information of only periodic frames is encoded and then transmitted along with origi- 
nal data. Fig. 3 (b) shows how transmitted redundant and original data can be. The 
redundant data of a periodic frame is created based on its previous {n-i} periodic 
frames. We refer to the maximum value of i as the order of Periodic FEC. As shown 
in Fig. 3, when the order value is 1, if the periodic frame denoted as n is lost, then the 
receiver waits for periodic frame n+\, decodes the redundant information, and dis- 
plays the reconstructed information. Consequently, this method stops error propaga- 
tion without any feedback information. 
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Bolot et al. proposed an adaptive FEC scheme in which the amount of redundant 
information can be adjusted to the packet loss rate in the network by increasing or 
decreasing the order values [1], As the order value becomes large, both the amount of 
redundant information and the probability of packet recovery increase. Like Bolot et 
al.’s scheme. Periodic FEC could adjust the amount of redundant information over 
time depending on the measured loss rate of periodic frames in the network. Flow- 
ever, the packet loss process should be observed during long periods because the 
interval of periodic frames is long and the loss probability of periodic frame is re- 
duced by FEC. Thus, this mechanism cannot quickly react to the network conditions. 
We propose a NACK-based redundancy control algorithm that can quickly react to 
network condition and easily be combined with other feedback-based error control 
algorithms. 




period distance 



(a) Frame Reference of PFEC 

parity code for 
n-1 frame 




Fig. 3. Frame reference and redundant information of Periodic FEC 



Fig. 4 shows the proposed algorithm based on the length of burst NACK of periodic 
frames. Initially, the order value is set to 0; no redundancy data is sent. When a 
NACK message is received, the order value is set to 1 . And, when two NACK mes- 
sages are received successively, the length of the burst NACK is 2, and the order 
value is set to 2. On the other hand, when the NACK message is not received for 
constant time MAX_WAITING_TIME, the order value is set to 0. Consequently, this 
algorithm can quickly react to NACK messages, reduce the additional redundancy of 
the Periodic FEC scheme and improve the video quality by dynamically adjusting the 
order value. 

Another approach to adjusting the amount of redundant information is to increase or 
decrease the distance of a period. As the distance increases, the amount of redundant 
data decreases because only a periodic frame is transmitted along with the redundant 
data in a period. When a NACK message is received, this mechanism reduces the 
distance of a period, and when a NACK message is not observed for a constant time, 
it increases the distance. Flowever, this approach cannot recover successive packet 
losses because the order value is fixed. In worst case, if the period distance is 1, the 
Periodic FEC will functions like the conventional EEC. 
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IF a NACK messages is received THEN 
BEGIN 

check the length of Burst NACK ; 

IP i-Burst_NACK <> Order THEN Order = : 

END 

ELSE 

BEGIN 

Check the elapse time T^i^ps since the last NACK 
was received; 

IF >= MAX_WAITING_TIME THEN order = 0 ; 

END 

Fig. 4. A NACK-based redundancy control algorithm 



3.3 Complementary Schemes 

In Periodic FEC, a periodic frame may not be recovered from the redundant informa- 
tion if packet losses of periodic frames would occur consecutively in the network. 
Then, any frames in the period cannot be decoded, and the error may be propagated 
to the next periodic frame. To prevent this error propagation, Periodic FEC can be 
combined with a complementary scheme such as retransmission, intra-frame 
replenishment, or reference picture selection. 

One approach is the combination with the inter-frame replenishment scheme. In 
RESCU[6], periodic frames cannot be recovered before the decoding of their depend- 
ant frame. This leads to error propagation and thus intra-frame replenishment was 
used to prevent error propagation. The receiver notifies the sender about the irrecov- 
erable losses, and the notification triggers the sender to code the next frame as an 
intra-frame. Like RESCU, Periodic FEC can be combined with intra-frame replen- 
ishment. 

Another approach is retransmission. When the receiver notifies the sender about the 
loss of a periodic frame, the sender retransmits only the redundant information of the 
lost periodic frame along with the next periodic frame. 

The third approach is to use the RPS method in the codec. When the receiver cannot 
recover a periodic frame, it sends a NACK message to the sender. Using this feed- 
back message, the encoder uses for motion prediction the periodic frame that is not 
reported missing. Thus, error propagation is eliminated without increasing bandwidth. 
Moreover, this approach can be nicely matched with the NACK-based redundancy 
control algorithm of Periodic FEC. 




698 T.-U. Choi et al. 



4 Experiments 

The objective of our experimental work is to study the potential effectiveness of Peri- 
odic FEC and its combination schemes. To achieve this objective, it is first shown that 
Periodic FEC has the ability to adapt to various network conditions. And, the superior 
performance of the Periodic FEC scheme to other existing recovery schemes is inves- 
tigated in terms of SNR and bit overhead. 

We modified the telenor FI.263 codec source to implement the Periodic PEC 
scheme and other schemes, and conducted actual video transmission tests over the 
Internet from Pusan to Seoul, South Korea. The transmission tests were conducted 
between 2 p.m. and 6 p.m. for 3 days. The average loss rate was about 10%, and the 
average delay was about 100ms. During the tests, the byte rate transmitted from the 
sender and SNR at the decoder were measured to compare their performance. The 
byte rate indicates bandwidth overhead, and SNR indicates video quality. 

For convenience, we refer to the Periodic FEC scheme as PFEC, and the combina- 
tion scheme of FEC and RPS is referred to as FECh-RPS. Also, PPEC of which the 
order value is fixed is called static PFEC, and PFEC of which the order value is 
changed depending on network loss is called dynamic PFEC, which is referred to as 
DPFEC 



4.1 A Comparison between Static and Dynamic PFEC 

To show the ability to control redundant information according to various packet 
losses, we compared dynamic PPEC to static PFEC that had the order value of 1 . As 
shown in the first half of the graphs in Fig. 5, the dynamic PFEC sends no redundant 
data because packet loss does not occur, while its SNR value is similar to that of the 
static PPEC. And in the latter half of the graphs, the order value is increased up to 2, 
due to the consecutive losses of periodic frames. Thus, more of the redundant data of 
the dynamic PFEC is transmitted more, and the SNR value increase more than that of 
the static PPEC. Consequently, the dynamic PPEC is superior to the static PFEC be- 
cause the static PFEC encodes and transmits the fixed amount of redundant data of all 
frames, while the dynamic PPEC encodes the redundant data only when lost frames 
exist. Table 1 shows the average performance of the two schemes. The dynamic 
PFEC can reduce the byte rate by 20% while its SNR is similar to that of the static 
PFEC. 



Table 1. Average SNR and byte rates of the static and the dynamic PFEC 





Avg. SNR 


Avg. Byte rate 


Static PFEC 


18.2 


1348.2 


Dynamic PPEC 


19.1 


1111.8 
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4.2 A Comparison between the Dynamic PFEC and Other Schemes 

The dynamic PFEC is compared with conventional schemes such as FEC, RPS to 
show the superior performance, and combination schemes such as FEC+RPS, 
DPFEC+ RPS to show the effectiveness of combination. It is assumed that the order 
of the FEC scheme is 1. Table 2 shows the average SNR and the byte rate of each 
scheme. 



Table 2. The average SNR and the byte rate of five types of schemes 





Avg. SNR 


Avg. Byte rate 


FEC 


18.8 


1607.5 


RPS 


16.8 


1047.4 


FECh-RPS 


20.5 


1640.8 


Dynamic PFEC 


18.7 


1111.8 


DPFECh-RPS 


19.2 


1136.9 



Fig. 6 shows the SNR and the byte rate of the FEC and the dynamic PFEC. The 
FEC provides additional bandwidth overhead because it encodes the redundant data 
of all the frames. Also, its SNR value is not high because it has no mechanism for 
preventing error propagation. However, the dynamic PFEC can stop error propaga- 
tion and minimize the amount of redundant data. As shown in Table 2, the FEC 
transmits 1.4 times more redundant data than the dynamic PFEC does, while their 
SNR values are similar. 

Fig. 7 shows a comparison of the dynamic PFEC to RPS. RPS depends on the pre- 
viously decoded frames for motion estimation. In consecutive packet losses, the in- 
terval between the encoding frame and the reference frame is increased and redundant 
information is reduced, resulting in a high frame rate. As shown in the latter part of 
the graphs in the Fig. 7, where the packet losses are heavy, the byte rate of RPS in- 
creases, and its SNR value becomes lower than that of the dynamic PFEC. Table 2 
shows that the dynamic PFEC can improve video quality (SNR) with small additional 
amount of redundant information. 

Fig. 8 shows a comparison of dynamic PFEC to RPSh- FEC. RPSh-FEC provides a 
good performance in terms of the SNR value and a poor performance in terms of the 
byte rate. This is a typical example that indicated the tradeoff between redundancy 
and video quality: the greater the amount of redundancy, the higher the performance. 
However, as shown in table 2, RPSh-FEC encodes 1 .4 times more redundant data, but 
the improvement of SNR was at most 9%. 

Fig. 9 shows a comparison of dynamic PFEC to the combination scheme, DPFECh- 
RPS. As shown in the figure, the SNR and the byte rate of DPFECh-RPS are similar to 
those of Dynamic PFEC. However, when packet losses are heavy, the dynamic PFEC 
could not recover periodic frames and prevent error propagation. Thus, under situa- 
tion of high packet loss, the SNR of DPFECh-RPS is higher than that of Dynamic 
PFEC. 
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Fig. 5. Comparison between the dynamic PFEC and the static PFEC 
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Fig. 6. Comparison between the dynamic PFEC and EEC 
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Fig. 7. Comparison between the dynamic PFEC and RPS 



In summary, the dynamic PFEC is superior to other schemes because it provides a 
good SNR at the cost of only a small amount of redundancy. Moreover, when it is 
combined with RPS, this scheme provides the maximum performance by preventing 
error propagation perfectly. 
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Fig. 8. Comparison between the dynamic PFEC and FEC+RPS 
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Fig. 9. Comparison between the dynamic PFEC and DPEEC + RPS 



5 Conclusion 

This paper classifies the video error control techniques into codec-level and network- 
level schemes, and investigates the effects of various combinations of these schemes. 
After considering strengths and weakness of combination schemes, we propose a new 
FEC-based video error control scheme, the Periodic FEC, which prevents of error 
propagation and provides a loss recovery mechanism with small bandwidth overhead. 
Also, the scheme can adjust the amount of redundant data, depending on network 
conditions. Through experiments, we show the effectiveness and adaptability of the 
dynamic PEEC and the performance of this scheme is maximized when combined 
with RPS. Euture studies will investigate the effectiveness of combinations of other 
schemes not discussed in this paper. 
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Abstract. The construction of photo-realistic virtual worlds is at reach of current 
computer graphics. Unfortunately, the philosophy currently adopted for the diffu- 
sion of virtual worlds over the Internet has a fundamental drawback: it calls for 
downloading at client side the 3D virtual world description. Recently, the split- 
browser approach was proposed in order to solve this issue by transforming the 
problem of interacting with a virtual worlds in a image transmission task. This 
work addresses the crucial issue of compressing the image stream generated in 
the split-browser approach. The proposed compression scheme first decomposes 
the virtual world as a union of objects, successively it approximates each object as 
a polyhedron and finally, it compresses the image of each face of the polyhedron 
by predicting it with a projective transformation. 



1 Introduction 

Three-dimensional virtual environments are often described in VRML, a language tai- 
lored for the Internet exchange of 3D virtual worlds. Unfortunately, the current way 
of accessing VRML described virtual worlds (i.e. downloading the VRML hie to the 
user’s computer, then render it) is plagued by several problems such as long download- 
ing times (a good quality VRML hie can be as large as 100 MB), poor interaction (due 
to the huge complexity of VRML hies which would require very powerful computers 
in order to have a smooth navigation) and loss of copyright control (producing a good 
quality VRML hie can require several months of work, but if the user can download it, 
the author looses the copyright control). 

In order to solve such problems, in |U the alternative split-browser approach for 
virtual worlds browsing was proposed. The split-browser approach is based on the idea 
of performing the rendering at the server’s side and sending the resulting images to the 
client, transforming the interaction with the VRML hie in a problem of image transmis- 
sion. 

It is clear that a major problem that must be solved in order to make the split-browser 
approach effective, is the compression of the generated images. Although such a prob- 
lem could be solved by means of off-the-shelf solutions such as JPEG or MPEG, the 
images transmitted with the split-browser approach have the peculiarity of being artih- 
cially generated by the server and this suggests that a specihcally tailored compression 
technique could result in a greater efficiency. 
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This paper presents a compression scheme that we are currently developing in order 
to allow efficient access to virtual worlds over the Internet. In the proposed compres- 
sion scheme the sequence of rendered views is compressed by splitting it in smaller 
images, which in turn are compressed by means of an MPEG-like approach, but using 
“generalized motion vectors” representing projective transformations. 

This paper has 0 sections. Section |3 briefly recalls the split-browser approach pro- 
posed in Ol, Section 0 describes the compression scheme that we are developing. Sec- 
tional briefly describes other issues related to the efficient access to remote 3D virtual 
environments. Section Ogives the conclusions. 



2 A Split-Browser Structure 

The recent development of 3D imaging technologies, e.g. the availability of range cam- 
eras and semiautomatic 3D modeling tools, makes feasible the construction of 3D models 
of real objects and suggests their use for interactive applications such as virtual visits. 

The only currently available way to view a 3D model, is by locally downloading a 
file (e.g. in VRML format) and having the client rendering it. This approach, however, 
has several drawbacks 

- Descriptions of 3D models can be quite large. As an example, the full, uncompressed 
model of “Madonna con bambino” by Giovanni Pisano (Fig. 01) is approximately 
100 Mbytes. 

- Rendering with good quality a VRML file requires a lot of memory and computa- 
tional power, often not available to home users. Interacting with virtual objects and 
environments can be quite frustrating if the rendering is too slow. 

- 3D model construction is a labor-intensive task and it is reasonable to assume that one 
may want to keep the copyright of the 3D model. If the electronic description of the 
model can be downloaded via Internet, copyright control becomes very challenging. 

In order to solve these problems, in HI the split-browser approach was proposed. In 
order to make this paper self-contained, we will briefly recall the main ideas behind 
the split-browser concept. As a first step, let us analyze how a VRML browser works. 
In Fig. ^ one can see the internal structure of a generic VRML browser; the end user 
interacts with a Graphical User Interface (GUI) which, by means of events, controls a 
virtual world and a graphical engine (GE) whose task is to produce 2D views of the 
virtual world. 

The solution proposed in m splits the browser in two: the part with the GE is moved 
to the server and the part with the GUI remains at the client (see Fig. Eb)- The internal 
events generated by the GUI in Fig. are now transmitted to the server along the 
network. At the server’s side the GE responds by sending back to the client the updated 
views as images. This approach, which essentially turns the interactive inspection of a 
3D model into an image transmission task, has the following advantages 

- The amount of data transmitted from the server to the client is much smaller than 
sending the full 3D model. 
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Fig. 1. 3D View of the complete model of “Madonna con Bambino” by Giovanni Pisano (Arena 
Chapel, Padova). 



- Visualization at the client side does not require strong computational capabilities 
anymore. 

- The copyright control of the 3D data is preserved, since the end user does not receive 
the whole model, but only 2D views of it. 

Observe also that the split-browser idea allows several users to interact with the same 
virtual environment, making possible applications like network video game and sharing 
of scientific data. 

Although the proposed solution is conceptually very simple, several questions must 
be answered in order to make it suitable for applications. A fundamental issue is about 
compression schemes suited to make more efficient the downloading of the updated 
views. Minor (but important) issues concern the transport protocol to be used for sending 
the compressed images and how to avoid problems due to packet loss when sending the 
packets over IP. In the following sections we are going to address such questions. 

3 A Compression Scheme 

Although off-the-shelf solutions (e.g., JPEG or MPEG) could be used in order to com- 
press the rendered view, it is well worth developing a compression scheme tailored to 
this particular application in order to exploit the fact that the images sent to the client 
are artificially generated. 

3.1 Images of Planar and Quasi-planar Objects 

Consider the situation depicted in Fig. 0 where the virtual world contains just a planar 
object 0(x) (a picture, for example), Vi is the position of the user’s “virtual camera” 
and Li(x) : — >■ M is the image shown on the user’s screen. Suppose now the user 
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Fig. 2. The Split-browser approach: (a) The internal structure of a generic VRML browser, (b) In 
the split-browser approach the graphical engine remotely runs at the server and sends the computed 
views to the client. 



moves to V 2 and let 7^2 (x) be the corresponding view. Our goal is to compress i 2 (x) 
by (eventually) exploiting the fact that L\ and L 2 are different views of the same object. 

It is well-known that if O is planar, there is a simple relationship between Li and 
L 2 , more precisely, 



L2(x) = Lx 



/ Ax + b \ 
\ c*x + d ) 



( 1 ) 
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Fig. 3. Two views Li and L 2 of the same planar object are related by a projective transformation. 

where A is a 2 x 2 real matrix, b, c G and d G M.. Transformations of type Q are 
called projective transformations. Parameters A, b, c ad d can be easily be computed 
by knowing the positions of the planar object O and the user’s virtual camera. 

If the object is not planar, but it can be fairly well approximated by a planar object, 
(i.e., it is a quasi-planar objectjl, it is reasonable to expect that CQ) “almost” holds, in 
the sense that the power of error image 



is small. This suggests to transmit L 2 by sending the four parameters A, b, c and d and 
the error image E compressed by means of a lossless or lossy technique. This approach 
resembles the approach used in MPEG, but with the motion vectors replaced by the 
“generalized motion vector” represented by A, b, c and d. 

3.2 Images of General Objects 

It is clear that the hypothesis of planar (or quasi-planar) object is a very strong one since 
most objects cannot be considered quasi-planar. In such cases, the object is first roughly 
approximated by means of a polyhedra whose faces correspond to quasi-planar zones 
of the original object. Successively, the region relative to each face of the polyhedron 
is compressed as a quasi-planar object. The rough approximation of the object can be 
easily obtained by simplifying the original model with well-known algorithms. 

* Note that the condition of being quasi-planar it depends both on the object and on the viewing 




( 2 ) 



distance 
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Fig. 4. (a) Close-up of the 3D model of Madonna con Bambino, (b) Example of a rough approxi- 
mation of (a) by means of a planar faces. 



More precisely, the compression algorithm we suggest is the following 

1 . Determine the set T of the faces visible in both the old and the new view 

2. For each face F ^ T found at the previous step 

(a) Determine the projection Pp of F on the old image 
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(b) Determine the parameters Ap, bi?, cp, dp of the corresponding projective 
transformation 

(c) Compute the distorted version 



Mp{pc) = L\ 



A^x + 

t I j I XPf 
CpX + dp J 



/ Apx + bi? 
\ c^x + dp 



(3) 



where xPp (x) = 1 if x S Pp and \Pf (x) = 0 otherwise. 

3. Compute the predicted image 



i/ 2 (x) = ^ Mp{x) 
FeP 

4. Compute the prediction error 

E(x) = L 2 (x) - L 2 (x) 



(4) 

(5) 



5. For each face F G P transmit 

- The parameters Ap, bj;’, cp, dp 

- The vertices of Pp 

6. Transmit the prediction error E{x), suitably compressed. 



Observe that the overall scheme is lossy or lossless, depending on the compression 
method used for E. A schematic picture of the proposed compression scheme can be 
found in Fig.El 

It is worth noting in Fig.0the presence of a cache. The motivation for its introduction 
is that while browsing a virtual world is very common to return in an already visited 
positions. It is clear that in such a case the rendered image will be equal to the image 
already sent to the client. In order to exploit this, it is convenient to keep a cache where 
the more recently generated views are saved. 

Note from Fig. 0 that the server uses the cached images also as reference images 
to be used in the compression scheme. This allows the server to use as reference any 
previously generated image and not necessarily the current one. 



3.3 The Sprite Model 

Until now we made the hypothesis that the virtual world contains only one object. Al- 
though this could be true in some applications (for example in a close-up of a statue 
during a virtual visit or in an e-commerce context), it is clear that several other applica- 
tions (e.g. video games) will have virtual worlds populated by several objects. 

The compression scheme presented in the previous sections can be nevertheless still 
used by exploiting the sprite model introduced in dJ . Within the sprite model, each image 
sent to the client is considered as a set of several independent layered images (with non- 
rectangular support) called sprites. Each sprite is the rendered view of a single object or 
a part of an object. 

A sprite is completely described by its support and by the RGB values associated to 
points of its support. The support can be described as a polygon (for simple supports, 
like the support of the sprite relative to a check-board) or by means of a transparency 
component, also known as alpha-channel. 

It is worth observing that the task of composing several sprites into a single image 
is well within the capabilities of current personal computers. 
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To the client 

Fig. 5. Proposed compression scheme. 



4 Other Issues 

4.1 Progressive Transmission 

The prediction error can be compressed both in a lossy or lossless way. If a lossy scheme 
is chosen, it can be convenient to use a wavelet-based one such as SPIHT |E| or EZW H 
which allow for progressive transmission (i.e., a coarse approximation of the compressed 
image can be reconstructed from any initial part of the resulting bit-stream). In order to 
understand why this can be convenient, consider a virtual world with a moving object. 
Each time the object moves the GE must generate a new view and send it to the client. 
If the object moves very fast, it could happen that the GE generates a new view before 
the complete description of the previous one is sent to the client. In applications like 
network games this would bring to an unacceptable loss of synchronization between 
the virtual world and the view shown to the users. However, since the view is changing 
rapidly, it is reasonable to assume that it should be possible to lower the quality of the 
compressed image since the user will not notice it. If the bit-stream allows for progressive 
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transmission, this can be achieved by just stopping the transmission of the old image 
and beginning to transmit the new one. 

4.2 Choosing the Transport Protocol and Resilience to Packet Loss 

At the moment, in our experimental implementation, we are using TCP as a transport 
protocol because of its simplicity. However, the retransmission mechanism used in TCP 
could introduce too high an overhead in the interaction between the user and the virtual 
world. Because of this, it could be convenient to switch to an UDP-based protocol, such 
as RTP. This choice, however, introduces the problem of dealing with packet losses. 
Since the bit-stream created by the split-browser approach contains more precious data 
(projective transform parameters and low-resolution components) and less precious ones 
(high-resolution error components), we plan to use priority encoded transmission |@|, 

n. 

5 Conclusions 

The construction of articulated and photo-realistic 3D virtual worlds is at reach of current 
computer graphics and 3D imaging technology. The philosophy currently adopted for 
the diffusion of 3D virtual worlds over the Internet via the VRML and/or current related 
extensions has a fundamental drawback: it calls for downloading at client side the 3D 
virtual world description. The split-browser approach solves this issue by giving an 
alternative solution which transforms the problem of interacting with a virtual worlds in 
an image transmission task. 

This work addresses the crucial issue of compressing the image stream in order to 
allow smooth interaction between the user and the virtual world. The proposed com- 
pression scheme first decomposes the virtual world as a union of objects, successively 
it approximates each object as a polyhedron, and, finally, it compresses the image of 
each face of the polyhedron by predicting it with a projective transformation and send- 
ing the parameters of the projective transformation together with the prediction error 
suitably compressed. In order to allow automatic bandwidth/image quality tradeoff, the 
prediction error should be compressed with a multiresolution scheme which allows for 
progressive transmission (e.g. SPIHT or EZW). 

Future research will address the issue of the transmission protocol to be used for the 
split-browser structure and (eventually) how to deal with packet loss. 
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Abstract. The increasing availability of multimedia contents and particularly the 
rapid development of the Internet have determined new ways to distribute infor- 
mation documents. In this scenario a copy protection system allowing to control 
the distribution of multimedia data is hardly required. A new technology useful for 
copyright protection is watermarking: a digital code (watermark), indicating the 
copyright owner, is directly embedded into the video signal. In this paper specific 
attention has been paid to the standard MPEG-4 and a digital video watermark- 
ing system has been designed in such a way that the complexity of the standard 
and the diversity of its applications are considered. In particular, the possibility 
of the MPEG-4 standard to directly access objects within a video sequence in- 
troduces a new constraint to the watermarking process: even if a video object is 
transferred from a sequence to another, the copyright data of the single object 
has to be correctly detected. Moreover, to make the watermarking system robust 
against format conversions, the code has to be inserted before compression. The 
method proposed in this paper satisfies the previous requirements by relying on 
an image watermarking algorithm which embeds the code in the Discrete Wavelet 
Transform of each frame. 



1 Introduction 

The day by day increasing availability of multimedia contents and the rapid development 
of communication networks, in particular the Internet, have determined that a new and 
unusual way to distribute information is offered. The huge growth of Internet users has 
created the chance to easily reach many people all over the world and a new market has 
been starting in the last few years. Anyway, to effectively develop, this trade needs to 
grant security during transactions and to avoid all the partecipating partners to be afraid 
of making their business on the net. Here after the specific case of video, according to 
the MPEG-4 standard, has been taken into account and the problems concerned with 
this medium has been investigated. The MPEG-4 standard jilj is very attractive for 
a large set of applications such as video editing, internet video distribution, wireless 
video communications. Each of these applications has a set of requirements regarding 
protection of the information it manages. Hence a copy protection system allowing to 
limit the duplication of multimedia data and to make broadcast monitoring possible is 
considered a mandatory requirement by multimedia data owners |2j. A new technology 
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that seems useful for copyright protection is watermarking: a watermarking system 
embeds a digital code (watermark), that can be used to indicate the copyright owner, 
directly into the video signal. 

A digital video watermarking system has to be designed so that some basic require- 
ments are satisfied: the embedded watermark should be perceptually invisible; the copy- 
right information should be robust against processing which do not seriously degrade 
the quality of the image. The false positive rate should be extremely low. Finally, the 
embedded watermark should be robust against attacks like cutting one or more frames of 
the video. To avoid this kind of attacks it is necessary to insert the copyright information 
continuously in the video sequence. 

In general, existing video watermarking methods have been conceived to work with 
MPEG-1 or MPEG-2 streams. These algorithms can be classified into two main classes 
according to the type of content watermarking is applied to: raw-video watermarking 
algorithms (omzi) which add the watermark before compression, or bit-stream 
watermarking systems (EEl) which embed the code after compression. 

A digital video watermarking system for MPEG-4 video has to be designed so that 
the complexity of the standard and the diversity of its applications have to be considered: 
in particular the main difference from previous video standards such as MPEG- 1 and 
MPEG-2 is that MPEG-4 coding is content-based, i.e. single objects are coded individ- 
ually. Each frame of an input sequence is segmented into a number of arbitrarily shaped 
regions. Video Object Planes (VOPs), and the shape, motion and texture information 
of the VOPs belonging to the same Video Object (VO) are coded into a separate Video 
Object Layer (VOL). Hence, the possibility of the MPEG-4 standard to directly access 
and manipulate objects within a video sequence introduces a new constraint to the wa- 
termarking process. An object watermarking system has to be designed in such a way 
that, even if a video object is transferred from a sequence to another, it is still possible 
to correctly detect the copyright data relating to the single object. 

A watermarking technique designed for MPEG-4 video streams has been proposed in 
nn . This algorithm embeds a watermark in each video object of an MPEG-4 coded video 
bit-stream by imposing specific relationships between some predefined pairs of quan- 
tized DCT middle frequency coefficients in the luminance blocks of pseudo-randomly 
selected macroblocks. The quantized coefficients are recovered from the MPEG-4 bit- 
stream, they are modified to embed the watermark and then encoded again. The main 
drawback of this technique is that, since the code is directly embedded into the com- 
pressed MPEG-4 bit-stream, the copyright information is lost if the video file is converted 
to a different compression standard, like MPEG-2. In order to be robust against format 
conversions, the watermark has to be inserted before compression, i.e. frame by frame. 
The problem is that if a Video Object (VO) of the scene is very small, it is very difficult 
to embed into it a watermark robust to the MPEG-4 compression. The method proposed 
in this paper satisfies the previous requirements: it belongs to the category of raw-video 
watermarking algorithms, since it operates frame by frame by casting a different water- 
mark in each video object of an MPEG-4 coded video bit-stream. Watermarking relies 
on the image watermarking algorithm presented in M, which embeds the code in the 
Discrete Wavelet Transform (DWT) domain d. The proposed system is invariant to 
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Fig. 1. Watermark casting process 



the conversion to a different compression standard, as described later, and it is able to 
detect a watermark also in very small regions of an image d. 

2 Watermark Embedding 

To embed the watermarks, the MPEG-4 coded video bit-stream is decoded obtaining a 
sequence of frames. The procedure described in Figure[I]is then applied frame by frame. 
Objects contained in the frame are extracted obtaining different images; in each image 
a different code is embedded by means of the system presented in ani and resumed in 
the following. 

The image to be watermarked is first decomposed through DWT in four levels: 
let us call Ij the sub-band at resolution level j — 0, 1, 2, 3 and having orientation 9 
where 9 = LL, LH, HL, HH. The watermark, consisting of a pseudo-random binary 
sequence, is inserted by modifying the wavelet coefficients belonging to the three detail 
bands at level 0, i.e. /(^^, and . This choice is motivated by experimental 
tests showing that it offers the best compromise between robustness and invisibility. 
Before adding it to the DWT values, each binary value is multiplied by a weighting 
parameter which is obtained by a noise sensitivity function. In this way the maximum 
tolerable level of disturb (i.e. watermark coefficient) is added to each DWT coefficient. 
The construction of the sensitivity function is mainly based on the analysis of the degree 
of image activity in the neighborhood of the pixel to be modified. 
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In particular, given a code sequence Xi G {+1,-1}, with i = 0, . . . 3MN — 1, 
where 2M x 2N is the image size, the three detail suh-bands are modified in this way: 

+ aw^^{iJ)xiN+j 

+ aw^^ii,j)xMN+iN+j ( 1 ) 

+ aW^^{i,j)x2MN+iN+j 

where a is a parameter accounting for watermark strength and w{i,j) is a weighting 
function taking into account the local sensitivity of the image to noise. 

The inverse DWT is computed. The watermarked images are then mixed together in 
order to obtain the frame containing the copyright information concerning the considered 
objects. When all the frames have been marked, the sequence is compressed, obtaining 
the watermarked MPEG-4 coded bit-stream. 

3 Watermark Detection 

In this section the process of watermark detection is analysed. The watermarked MPEG- 

4 coded video bit-stream is decoded obtaining a sequence of frames. Once again, the 
objects present in the scene are extracted frame by frame, obtaining a different image for 
each object. The DWT of each image is then computed; next, the code corresponding 
to the object is detected by means of the correlation between the watermark, and the 
DWT marked coefficients. The value of the correlation is compared to a threshold to 
decide if the watermark is present or not. An optimum threshold has been theoretically 
set to minimize the probability of false positive detection. In particular, the value of this 
threshold depends on the variance of the DWT coefficients of the watermarked image, 
and can thus be computed a-posteriori, without the need of knowing data concerning 
the original frame or the watermark embedding process. 

Since the detection process is computed frame by frame, the watermark embedded 
into a video object can be revealed also if the VO is transferred from a sequence to 
another. However, since DWT is not invariant to translation, if the video object is placed 
in a different position of the new scene, the synchronization between the watermark 
and the VO is lost. To cope with this, the watermark detector needs to compute the 
correlation for all shifts of the frame. To reduce the computational complexity, the 2 
dimensional Fourier Transform can be used: the peak position of the Fourier Transform 
in fact indicates the translation of the VO required to recover the synchronization before 
computing the watermark detection. 

4 Experimental Results 

The proposed watermarking algorithm has been tested on different video sequences. 
Here, the results concerning the GIF sequence Stefan are shown. Each frame is composed 
by two different objects, the background (Video Object 1) and the player (Video Object 
0). We introduced a different random sequence into each object, as in Figure [0 and 
after watermarking the sequence was compressed obtaining a MPEG-4 coded video bit- 
stream, with a rate of 500 Kb/s per Video Object Layer (VOL). The video stream was next 
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decompressed and in each frame the two objects were separated obtaining two different 
images, where the detection process was applied. Detection results are shown in Figure 
□ (a) for VO 0, and (b) for VO 1 ; the response of the watermark detector to 200 marks 
randomly generated shows that the response to the embedded watermark sequence (i.e. 
no. 10 for VO 0, no. 20 for VO 1) is much larger than the response to the others and it is 
higher than the detection threshold. This result is interesting, since the player is a very 
small object that was heavily compressed after watermark embedding. Moreover, the 
detection process was also applied to the complete frame, without selecting the objects: as 
demonstrated in Figure|^(c), the two watermarks embedded in the two objects are easily 
detected; let us note that the correlation peak of the tennis player (VO 0) is lower than the 
response of the background (VO 1), due to the fact that a low watermark energy can be 
embedded into the tennis player. The correct detection of the two objects indicates that 
the system is robust to conversion from MPEG-4 to MPEG-2, since the two watermarks 
are revealed in the frame, even if, clearly, in this case it is no longer possible to associate 
the watermark to the corresponding Video Object. 

A new test has been done by introducing into the video sequence Flowers, previously 
watermarked with the code number 100 (see FigureQ(a)), the Video Object represented 
by the player (FigureEl(b)), obtaining a new sequence where Stefan seems playing in the 
flower garden (Figure E](c)). We applied the watermark detector to this new sequence. 




Fig. 2. Watermark detection response relating to the VO 0 (a), the VO 1 (b), and the frame (c). 
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obtaining the results shown in Figure 0| again, the two objects are clearly detected, even 
if, as already explained, the correlation peak of the player is lower than the peak of the 
background. 

5 Conclusions 

The MPEG-4 standard is revealing very attractive for a wide variety of multimedia appli- 
cations, where a copy protection system allowing to limit the duplication of multimedia 
data and to make broadcast monitoring possible is required. A new technology that seems 
useful for copyright protection is watermarking. With regard to the complexity of the 
MPEG-4 standard and the diversity of its applications, new constraints are introduced 
to the watermarking process: in particular, the possibility of the MPEG-4 standard to 
directly access and manipulate objects within a video sequence implies that even if a 
video object is transferred from a sequence to another, the copyright data of the single 
object has to be correctly detected. Another important requirement is that, in order to be 




Fig. 3. A frame of the video sequence Flowers, previously watermarked (a), the Video Object 
represented by the player (b), and the new sequence where Stefan seems playing in the flower 
garden (c). 
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Fig. 4. Detection results of the new video sequence represented in Figure 0(c). 



robust against format conversions, the watermark has to be inserted before compression, 
i.e. frame by frame. 

The method proposed in this paper satisfies the previous requirements; it belongs to 
the category of raw- video watermarking algorithms, since it operates frame by frame 
by casting a different watermark in each video object of an MPEG-4 coded video bit- 
stream. Watermarking relies on the image watermarking algorithm presented in uni, 
which embeds the code in the Discrete Wavelet Transform (DWT) domain. Future work 
will be dedicated to test the system against a larger set of attacks. 
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